WaveNet: A Generative Model for Raw Audio

ABSTRACT

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-ofthe-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu
{avdnoord, sedielem, heigazen, simonyan, vinyals, gravesa, nalk, andrewsenior, korayk}@google.com
Google DeepMind

Download the full paper here

Facebook AI Research – Machine Intelligence Roadmap

Abstract

The development of intelligent machines is one of the biggest unsolved challenges in computer science. In this paper, we propose some fundamental properties these machines should have, focusing in particular on communication and learning. We discuss a simple environment that could be used to incrementally teach a machine the basics of natural-language-based communication, as a prerequisite to more complex interaction with human users. We also present some conjectures on the sort of algorithms the machine should support in order to profitably learn from the environment.

Tomas Mikolov, Armand Joulin, Marco Baroni
Facebook AI Research

1 Introduction

A machine capable of performing complex tasks without requiring laborious programming would be tremendously useful in almost any human endeavour, from performing menial jobs for us to helping the advancement of basic and applied research. Given the current availability of powerful hardware and large amounts of machine-readable data, as well as the widespread interest in sophisticated machine learning methods, the times should be ripe for the development of intelligent machines.

We think that one fundamental reasons for this is that, since “solving AI” at once seems too complex a task to be pursued all at once, the computational community has preferred to focus, in the last decades, on solving relatively narrow empirical problems that are important for specific applications, but do not address the overarching goal of developing general-purpose intelligent machines.

In this article, we propose an alternative approach: we first define the general characteristics we think intelligent machines should possess, and then we present a concrete roadmap to develop them in realistic, small steps, that are however incrementally structured in such a way that, jointly, they should lead us close to the ultimate goal of implementing a powerful AI. We realise that our vision of artificial intelligence and how to create it is just one among many. We focus here on a plan that, we hope, will lead to genuine progress, without by this implying that there are not other valid approaches to the task.

The article is structured as follows. In Section 2 we indicate the two fundamental characteristics that we consider crucial for developing intelligence– at least the sort of intelligence we are interested in–namely communication and learning. Our goal is to build a machine that can learn new concepts through communication at a similar rate as a human with similar prior knowledge. That is, if one can easily learn how subtraction works after mastering addition, the intelligent machine, after grasping the concept of addition, should not find it difficult to learn subtraction as well.

Since, as we said, achieving the long-term goal of building an intelligent machine equipped with the desired features at once seems too difficult, we need to define intermediate targets that can lead us in the right direction. We specify such targets in terms of simplified but self-contained versions of the final machine we want to develop. Our plan is to “educate” the target machine like a child: At any time in its development, the target machine should act like a stand-alone intelligent system, albeit one that will be initially very limited in what it can do. The bulk of our proposal (Section 3) thus consists in the plan for an interactive learning environment fostering the incremental development of progressively more intelligent behaviour.

Section 4 briefly discusses some of the algorithmic capabilities we think a machine should possess in order to profitably exploit the learning environment. Finally, Section 5 situates our proposal in the broader context of past and current attempts to develop intelligent machines.

Download the full paper here

The Lovelace 2.0 Test of Artificial Creativity and Intelligence

The Lovelace 2.0 Test asks whether a computer can create an artefact – a poem, story, painting or architectural design – that expert and unbiased observers would conclude was designed by a human.

Prof. Mark Riedl proposes the concept of artificial creativity, akin to artificial intelligence. This could be tested he says, using an alternative to the Turing Test, the AI benchmark that asserts that if a computer system can fool a human being into thinking it is human itself, then it can be said to be truly intelligent.

You can read Prof. Riedl’s paper below:

Abstract

Observing that the creation of certain types of artistic artefacts necessitate intelligence, we present the Lovelace 2.0 Test of creativity as an alternative to the Turing Test as a means of determining whether an agent is intelligent.

The Lovelace 2.0 Test builds off prior tests of creativity and additionally provides a means of directly comparing the relative intelligence of different agents.

Mark O. Riedl
School of Interactive Computing; Georgia Institute of Technology
riedl@cc.gatech.edu

Download the full paper here

Alternative structures for character-level RNNs

Abstract

Recurrent neural networks are convenient and efficient models for language modeling. However, when applied on the level of characters instead of words, they suffer from several problems. In order to successfully model long-term dependencies, the hidden representation needs to be large. This in turn implies higher computational costs, which can become prohibitive in practice. We propose two alternative structural modifications to the classical RNN model. The first one consists on conditioning the character level representation on the previous word representation. The other one uses the character history to condition the output probability. We evaluate the performance of the two proposed modifications on challenging, multi-lingual real world data.

Piotr Bojanowski ∗
INRIA
Paris, France
piotr.bojanowski@inria.fr

Armand Joulin and Tomas Mikolov
Facebook AI Research
New York, NY, USA
tmikolov.ajoulin@fb.com

Download the full paper here

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed- forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output.

Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition.

INTRODUCTION

New machine learning algorithms can lead to significant advances in automatic speech recognition. The biggest single advance occured nearly four decades ago with the introduction of the Expectation-Maximization (EM) algorithm for training Hidden Markov Models (HMMs). With the EM algorithm, it became possible to develop speech recognition systems for real world tasks using the richness of Gaussian mixture models (GMM) to represent the relationship between HMM states and the acoustic input. In these systems the acoustic input is typically represented by concatenating Mel Frequency Cepstral Coefficients (MFCCs) or Perceptual Linear Predictive coefficients (PLPs) computed from the raw waveform, and their first- and second-order temporal differences. This non-adaptive but highly- engineered pre-processing of the waveform is designed to discard the large amount of information in waveforms that is considered to be irrelevant for discrimination and to express the remaining information in a form that facilitates discrimination with GMM-HMMs.

GMMs have a number of advantages that make them suitable for modeling the probability distributions over vectors of input features that are associated with each state of an HMM. With enough components, they can model probability distributions to any required level of accuracy and they are fairly easy to fit to data using the EM algorithm. A huge amount of research has gone into ways of constraining GMMs to increase their evaluation speed and to optimize the trade-off between their flexibility and the amount of training data available to avoid serious overfitting.

The recognition accuracy of a GMM-HMM system can be further improved if it is discriminatively fine-tuned after it has been generatively trained to maximize its probability of generating the observed data, especially if the discriminative objective function used for training is closely related to the error rate on phones, words or sentences[7]. The accuracy can also be improved by augmenting (or concatenating) the input features (e.g., MFCCs) with “tandem” or bottleneck features generated using neural networks. GMMs are so successful that it is difficult for any new method to outperform them for acoustic modeling.

Despite all their advantages, GMMs have a serious shortcoming – they are statistically inefficient for modeling data that lie on or near a non-linear manifold in the data space. For example, modeling the set of points that lie very close to the surface of a sphere only requires a few parameters using an appropriate model class, but it requires a very large number of diagonal Gaussians or a fairly large number of full-covariance Gaussians. Speech is produced by modulating a relatively small number of parameters of a dynamical system [10], [11] and this implies that its true underlying structure is much lower-dimensional than is immediately apparent in a window that contains hundreds of coefficients. We believe, therefore, that other types of model may work better than GMMs for acoustic modeling if they can more effectively exploit information embedded in a large window of frames.

Artificial neural networks trained by backpropagating error derivatives have the potential to learn much better models of data that lie on or near a non-linear manifold. In fact two decades ago, researchers achieved some success using artificial neural networks with a single layer of non-linear hidden units to predict HMM states from windows of acoustic coefficients. At that time, however, neither the hardware nor the learning algorithms were adequate for training neural networks with many hidden layers on large amounts of data and the performance benefits of using neural networks with a single hidden layer were not sufficiently large to seriously challenge GMMs. As a result, the main practical contribution of neural networks at that time was to provide extra features in tandem or bottleneck systems.

Over the last few years, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (DNNs) that contain many layers of non-linear hidden units and a very large output layer. The large output layer is required to accommodate the large number of HMM states that arise when each phone is modelled by a number of different “triphone” HMMs that take into account the phones on either side. Even when many of the states of these triphone HMMs are tied together, there can be thousands of tied states. Using the new learning methods, several different research groups have shown that DNNs can outperform GMMs at acoustic modeling for speech recognition on a variety of datasets including large datasets with large vocabularies.

This review paper aims to represent the shared views of research groups at the University of Toronto, Microsoft Research (MSR), Google and IBM Research, who have all had recent successes in using DNNs for acoustic modeling. The paper starts by describing the two-stage training procedure that is used for fitting the DNNs. In the first stage, layers of feature detectors are initialized, one layer at a time, by fitting a stack of generative models, each of which has one layer of latent variables. These generative models are trained without using any information about the HMM states that the acoustic model will need to discriminate. In the second stage, each generative model in the stack is used to initialize one layer of hidden units in a DNN and the whole network is then discriminatively fine-tuned to predict the target HMM states. These targets are obtained by using a baseline GMM-HMM system to produce a forced alignment.

In this paper we review exploratory experiments on the TIMIT database that were used to demonstrate the power of this two-stage training procedure for acoustic modeling. The DNNs that worked well on TIMIT were then applied to five different large vocabulary, continuous speech recognition tasks by three different research groups whose results we also summarize. The DNNs worked well on all of these tasks when compared with highly-tuned GMM-HMM systems and on some of the tasks they outperformed the state-of-the-art by a large margin. We also describe some other uses of DNNs for acoustic modeling and some variations on the training procedure.

View the full PDF publication here.

[Geoffrey Hinton, Li Deng, Dong Yu, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, George Dahl, and Brian Kingsbury]

The Singularity: A Philosophical Analysis

David Chalmers is a leading philosopher of mind, and the first to publish a major philosophy journal article on the singularity:

Chalmers, D. (2010). “The Singularity: A Philosophical Analysis.” Journal of Consciousness Studies 17:7-65.

Chalmers’ article is a “survey” article in that it doesn’t cover any arguments in depth, but quickly surveys a large number of positions and arguments in order to give the reader a “lay of the land.” Because of this, Chalmers’ paper is a remarkably broad and clear introduction to the singularity.

Singularitarian authors will also be pleased that they can now cite a peer-reviewed article by a leading philosopher of mind who takes the singularity seriously.

Below is a CliffsNotes of the paper for those who don’t have time to read all 58 pages of it.

The Singularity: Is It Likely?

Chalmers focuses on the “intelligence explosion” kind of singularity, and his first project is to formalise and defend I.J. Good’s 1965 argument. Defining AI as being “of human level intelligence,” AI+ as AI “of greater than human level” and AI++ as “AI of far greater than human level” (super intelligence), Chalmers updates Good’s argument to the following:

  1. There will be AI (before long, absent defeaters).
  2. If there is AI, there will be AI+ (soon after, absent defeaters).
  3. If there is AI+, there will be AI++ (soon after, absent defeaters).
  4. Therefore, there will be AI++ (before too long, absent defeaters).

By “defeaters,” Chalmers means global catastrophes like nuclear war or a major asteroid impact. One way to satisfy premise (1) is to achieve AI through brain emulation (Sandberg & Bostrom, 2008). Against this suggestion, Lucas (1961), Dreyfus (1972), and Penrose (1994) argue that human cognition is not the sort of thing that could be emulated. Chalmers (1995; 1996, chapter 9) has responded to these criticisms at length. Briefly, Chalmers notes that even if the brain is not a rule-following algorithmic symbol system, we can still emulate it if it is mechanical. (Some say the brain is not mechanical, but Chalmers dismisses this as being discordant with the evidence.)
Searle (1980) and Block (1981) argue instead that even if we can emulate the human brain, it doesn’t follow that the emulation is intelligent or has a mind. Chalmers says we can set these concerns aside by stipulating that when discussing the singularity, AI need only be measured in terms of behavior. The conclusion that there will be AI++ at least in this sense would still be massively important.

Another consideration in favor of premise (1) is that evolution produced human-level intelligence, so we should be able to build it, too. Perhaps we will even achieve human-level AI by evolving a population of dumber AIs through variation and selection in virtual worlds. We might also achieve human-level AI by direct programming or, more likely, systems of machine learning.

Premise (2) is plausible because AI will probably be produced by an extendible method, and so extending that method will yield AI+. Brain emulation might turn out not to be extendible, but the other methods are. Even if human-level AI is first created by a non-extendible method, this method itself would soon lead to an extendible method, and in turn enable AI+. AI+ could also be achieved by direct brain enhancement.

Premise (3) is the amplification argument from Good: an AI+ would be better than we are at designing intelligent machines, and could thus improve its own intelligence. Having done that, it would be even better at improving its intelligence. And so on, in a rapid explosion of intelligence.

In section 3 of his paper, Chalmers argues that there could be an intelligence explosion without there being such a thing as “general intelligence” that could be measured, but I won’t cover that here.

In section 4, Chalmers lists several possible obstacles to the singularity.

Constraining AI

Next, Chalmers considers how we might design an AI+ that helps to create a desirable future and not a horrifying one. If we achieve AI+ by extending the method of human brain emulation, the AI+ will at least begin with something like our values. Directly programming friendly values into an AI+ (Yudkowsky, 2004) might also be feasible, though an AI+ arrived at by evolutionary algorithms is worrying.

Most of this assumes that values are independent of intelligence, as Hume argued. But if Hume was wrong and Kant was right, then we will be less able to constrain the values of a superintelligent machine, but the more rational the machine is, the better values it will have.

Another way to constrain an AI is not internal but external. For example, we could lock it in a virtual world from which it could not escape, and in this way create a leakproof singularity. But there is a problem. For the AI to be of use to us, some information must leak out of the virtual world for us to observe it. But then, the singularity is not leakproof. And if the AI can communicate us, it could reverse-engineer human psychology from within its virtual world and persuade us to let it out of its box – into the internet, for example.

Our Place in a Post-Singularity World

Chalmers says there are four options for us in a post-singularity world: extinction, isolation, inferiority, and integration.

The first option is undesirable. The second option would keep us isolated from the AI, a kind of technological isolationism in which one world is blind to progress in the other. The third option may be infeasible because an AI++ would operate so much faster than us that inferiority is only a blink of time on the way to extinction.

For the fourth option to work, we would need to become superintelligent machines ourselves. One path to this mind bemind uploading, which comes in several varieties and has implications for our notions of consciousness and personal identity that Chalmers discusses but I will not. (Short story: Chalmers prefers gradual uploading, and considers it a form of survival.)

Conclusion

Chalmers concludes:

Will there be a singularity? I think that it is certainly not out of the question, and that the main obstacles are likely to be obstacles of motivation rather than obstacles of capacity.

How should we negotiate the singularity? Very carefully, by building appropriate values into machines, and by building the first AI and AI+ systems in virtual worlds.

How can we integrate into a post-singularity world? By gradual uploading followed by enhancement if we are still around then, and by reconstructive uploading followed by enhancement if we are not.

References

Block (1981). “Psychologism and behaviorism.” Philosophical Review 90:5-43.

Chalmers (1995). “Minds, machines, and mathematics.” Psyche 2:11-20.

Chalmers (1996). The Conscious Mind. Oxford University Press.

Dreyfus (1972). What Computers Can’t Do. Harper & Row.

Lucas (1961). “Minds, machines, and Godel.” Philosophy 36:112-27.

Penrose (1994). Shadows of the Mind. Oxford University Press.

Sandberg & Bostrom (2008). “Whole brain emulation: A roadmap.” Technical report 2008-3, Future for Humanity Institute, Oxford University.

Searle (1980). “Minds, brains, and programs.” Behavioral and Brain Sciences 3:417-57.

Yudkowsky (2004). “Coherent Extrapolated Volition.”