AI News, The Neural Network That Remembers

The Neural Network That Remembers

It’s pretty drinkable, but I wouldn’t mind if this beer was available.” Besides the overpowering bouquet of raspberries in this guy’s beer, this review is remarkable for another reason.

It was produced by a computer program instructed to hallucinate a review for a “fruit/vegetable beer.” Using a powerful artificial-intelligence tool called a recurrent neural network, the software that produced this passage isn’t even programmed to know what words are, much less to obey the rules of English syntax.

The neural network learns proper nouns like “Coors Light” and beer jargon like “lacing” and “snifter.” It learns to spell and to misspell, and to ramble just the right amount.

It knows to describe India pale ales as “hoppy,” stouts as “chocolatey,” and American lagers as “watery.” The neural network also learns more colorful words for lagers that we can’t put in print.

This particular neural network can also run in reverse, taking any review and recognizing the sentiment (star rating) and subject (type of beer).

Envisioning what has since become known as a Turing test, he proposed that if the computer could imitate a person so convincingly as to fool a human judge, you could reasonably deem it to be intelligent.

Asimov’s tales, written before the phrase “artificial intelligence” existed, feature cunning robots engaging in conversations, piloting vehicles, and even helping to govern society.

On the other side, more practically oriented researchers apply machine learning to various real-world tasks, guided more by experimentation than by mathematical theory.

However, breakthroughs in neural-network research have revolutionized computer vision and natural-language processing, rekindling the imaginations of the public, researchers, and industry.

That’s because, until recently, machine learning was dominated by methods with well-understood theoretical properties, whereas neural-network research relies more on experimentation.

Nevertheless, the capabilities of recurrent neural networks are undeniable and potentially open the door to the kinds of deeply interactive systems people have hoped for—or feared—for generations.

Neurons, as you might recall from high school biology class, are cells that fire off electrical signals or refrain from doing so depending on signals received from the other neurons attached to them.

To determine the intensity of an artificial neuron’s firing or, more properly, its activation, we calculate a weighted sum of the activations of all the neurons that feed into it.

To dodge this problem entirely and to simplify the computations, we typically arrange the neurons in layers, with each neuron in a layer connected to the neurons in the layer above, making for many more connections than neurons.

The ultimate output of the network—say,acategorization of the input image as depicting a cat, dog, or person—is read from the activations of the artificial neurons in the very top layer.

The hard part is training your neural network to produce something useful, which is to say, tinkering with the (perhaps millions of) weights corresponding to the connections between the artificial neurons.

That is, you repeatedly adjust the connection weights a small amount, bringing the output of the network incrementally closer to the ground truth.

While determining the correct updates to each of the weights can be tricky, we can calculate them efficiently with a well-known technique called backpropagation, which was developed roughly 30 years by David Rumelhart, Geoff Hinton, and Ronald Williams.

Early on, computer scientists built neural networks with just three tiers: the input layer, a single hidden layer, and the output layer.

At step 2, however, the hidden layers also receive activation flowing across time from the corresponding hidden layers from step 1.

Computer scientists have known for decades that recurrent neural networks are powerful tools, capable of performing any well-defined computational procedure.

It’s true, but there’s a huge gap between knowing that your tool can in theory be used to write some desired program and knowing exactly how to construct it.

We won’t go too far into the weeds describing memory cells here, but the basic idea is to provide the network with memory that persists longer than the immediately forgotten activations of simple artificial neurons.

Memory cells give the network a form of medium-term memory, in contrast to the ephemeral activations of a feed-forward net or the long-term knowledge recorded in the settings of the weights.

In collaboration with David Kale of the University of Southern California and Randall Wetzell of Children’s Hospital Los Angeles, we devised a recurrent neural network that could make diagnoses after processing sequences of observations taken in the hospital’s pediatric intensive-care unit.

The sequences consisted of 13 frequently but irregularly sampled clinical measurements, including heart rate, blood pressure, blood glucose levels, and measures of respiratory function.

The network proved able to recognize diverse conditions such as brain cancer, status asthmaticus (unrelenting asthma attacks), and diabetic ketoacidosis (a serious complication of diabetes where the body produces excess blood acids) with remarkable accuracy.

The promising results from our medical application demonstrate the power of recurrent neural networks to capture the meaningful signal in sequential data.

Success in this context really means getting someone to declare, “There’s no way a computer wrote that!” In this sense, the computer-science community is evaluating recurrent neural networks via a kind of Turing test.

Recently, researchers at Google DeepMind combined reinforcement learning with feed-forward neural networks to create a system that can beat human players at 31different video games.

Recurrent neural network

A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence.

The term 'recurrent neural network' is used indiscriminately to refer to two broad classes of networks with a similar general structure, where one is finite impulse and the other is infinite impulse.

A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

Both finite impulse and infinite impulse recurrent networks can have additional stored state, and the storage can be under direct control by the neural network.

Such controlled states are referred to as gated state or gated memory, and are part of long short-term memorys (LSTMs) and gated recurrent units.

In 1993, a neural history compressor system solved a 'Very Deep Learning' task that required more than 1000 subsequent layers in an RNN unfolded in time.[5]

In 2014, the Chinese search giant Baidu used CTC-trained RNNs to break the Switchboard Hub5'00 speech recognition benchmark without using any traditional speech processing methods.[10]

Basic RNNs are a network of neuron-like nodes organized into successive 'layers', each node in a given layer is connected with a directed (one-way) connection to every other node in the next successive layer.[citation needed]

Nodes are either input nodes (receiving data from outside the network), output nodes (yielding results), or hidden nodes (that modify the data en route from input to output).

For supervised learning in discrete time settings, sequences of real-valued input vectors arrive at the input nodes, one vector at a time.

At any given time step, each non-input unit computes its current activation (result) as a nonlinear function of the weighted sum of the activations of all units that connect to it.

For example, if the input sequence is a speech signal corresponding to a spoken digit, the final target output at the end of the sequence may be a label classifying the digit.

Instead a fitness function or reward function is occasionally used to evaluate the RNN's performance, which influences its input stream through output units connected to actuators that affect the environment.

An Elman network is a three-layer network (arranged horizontally as x, y, and z in the illustration) with the addition of a set of 'context units' (u in the illustration).

The fixed back-connections save a copy of the previous values of the hidden units in the context units (since they propagate over the connections before the learning rule is applied).

Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard multilayer perceptron.

Each neuron in one layer only receives its own past state as context information (instead of full connectivity to all other neurons in this layer) and thus neurons are independent of each other's history.

Given a lot of learnable predictability in the incoming data sequence, the highest level RNN can use supervised learning to easily classify even deep sequences with long intervals between important events.

Once the chunker has learned to predict and compress inputs that are unpredictable by the automatizer, then the automatizer can be forced in the next learning phase to predict or imitate through additional units the hidden units of the more slowly changing chunker.

LSTM works even given long delays between significant events and can handle signals that mix low and high frequency components.

to find an RNN weight matrix that maximizes the probability of the label sequences in a training set, given the corresponding input sequences.

continuous time recurrent neural network (CTRNN) uses a system of ordinary differential equations to model the effects on a neuron of the incoming spike train.

Note that, by the Shannon sampling theorem, discrete time recurrent neural networks can be viewed as continuous-time recurrent neural networks where the differential equations have transformed into equivalent difference equations.

multiple timescales recurrent neural network (MTRNN) is a neural-based computational model that can simulate the functional hierarchy of the brain through self-organization that depends on spatial connection between neurons and on distinct types of neuron activities, each with distinct time properties.[53][54]

With such varied neuronal activities, continuous sequences of any set of behaviors are segmented into reusable primitives, which in turn are flexibly integrated into diverse sequential behaviors.

In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-linear activation functions are differentiable.

In this context, local in space means that a unit's weight vector can be updated using only information stored in the connected units and the unit itself such that update complexity of a single unit is linear in the dimensionality of the weight vector.

Local in time means that the updates take place continually (on-line) and depend only on the most recent time step rather than on multiple time steps within a given time horizon as in BPTT.

For recursively computing the partial derivatives, RTRL has a time-complexity of O(number of hidden x number of weights) per time step for computing the Jacobian matrices, while BPTT only takes O(number of weights) per time step, at the cost of storing all forward activations within the given time horizon.[63]

major problem with gradient descent for standard RNN architectures is that error gradients vanish exponentially quickly with the size of the time lag between important events.[34][67]

This fact improves stability of the algorithm, providing a unifying view on gradient calculation techniques for recurrent networks with local feedback.

A target function can be formed to evaluate the fitness or error of a particular weight vector as follows: First, the weights in the network are set according to the weight vector.

Initially, the genetic algorithm is encoded with the neural network weights in a predefined manner where one gene in the chromosome represents one weight link.The whole network is represented as a single chromosome.

Other global (and/or evolutionary) optimization techniques may be used to seek a good set of weights, such as simulated annealing or particle swarm optimization.

Whereas recursive neural networks operate on any hierarchical structure, combining child representations into parent representations, recurrent neural networks operate on the linear progression of time, combining the previous time step and a hidden representation into the representation for the current time step.

In particular, RNNs can appear as nonlinear versions of finite impulse response and infinite impulse response filters and also as a nonlinear autoregressive exogenous model (NARX).[74]

Recurrent Neural Networks

Some of the most successful applications in machine learning (including deep learning) are now driven by RNNs such as Long Short-Term Memory, e.g., speech recognition, video recognition, natural language processing, image captioning, time series prediction, etc.

At this symposium, we will review the latest developments in all of these fields, and focus not only on RNNs, but also on learning machines in which RNNs interact with external memory such as neural Turing machines, memory networks, and related memory architectures such as fast weight networks and neural stack machines.

Our target audience has heard a bit about RNNs, the deepest of all neural networks, but will be happy to hear again a summary of the basics and then delve into the latest advanced topics to see and understand what has recently become possible.

The Unreasonable Effectiveness of Recurrent Neural Networks

Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?” By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs.

A few examples may make this more concrete: As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems.

You might be thinking that having sequences as inputs or outputs could be relatively rare, but an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to process them in a sequential manner.

On the right, a recurrent network generates images of digits by learning to sequentially add color to a canvas (Gregor et al.): The takeaway is that even if your data is not in form of sequences, you can still formulate and train powerful models that learn to process it sequentially.

If you’re more comfortable with math notation, we can also write the hidden state update as \( h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t ) \), where tanh is applied elementwise.

We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs y you’d like to see in response to your input sequences x.

Here’s a diagram: For example, we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”.

Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers).

If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea.

And if you have to act the big company too.” Okay, clearly the above is unfortunately not going to replace Paul Graham anytime soon, but remember that the RNN had to learn English completely from scratch and with a small dataset (including where you put commas, apostrophes and spaces).

In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say: “is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same” looks like we’ve reached an infinite loop about startups.

There’s also quite a lot of structured markdown that the model learns, for example sometimes it creates headings, lists, etc.: Sometimes the model snaps into a mode of generating random but valid XML: The model completely makes up the timestamp, id, and so on.

We had to step in and fix a few issues manually but then you get plausible looking math, it’s quite astonishing: Here’s another sample: As you can see above, sometimes the model tries to generate latex diagrams, but clearly it hasn’t really figured them out.

This is an example of a problem we’d have to fix manually, and is likely due to the fact that the dependency is too long-term: By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma.

In particular, I took all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file (474MB of C code) (I was originally going to train only on the kernel but that by itself is only ~16MB).

This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code: There are too many fun parts to cover- I could probably write an entire blog post on just this part.

Of course, you can imagine this being quite useful inspiration when writing a novel, or naming a new startup :) We saw that the results at the end of training can be impressive, but how does any of this work?

At 300 iterations we see that the model starts to get an idea about quotes and periods: The words are now also separated with spaces and the model starts to get the idea about periods at the end of a sentence.

Longer words have now been learned as well: Until at last we start to get properly spelled words, quotations, names, and so on by about iteration 2000: The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words;

In the visualizations below we feed a Wikipedia RNN model character data from the validation set (shown along the blue/green rows) and under every character we visualize (in red) the top 5 guesses that the model assigns for the next character.

Think about it as green = very excited and blue = not very excited (for those familiar with details of LSTMs, these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state).

Below we’ll look at 4 different ones that I found and thought were interesting or interpretable (many also aren’t): Of course, a lot of these conclusions are slightly hand-wavy as the hidden state of the RNN is a huge, high-dimensional and largely distributed representation.

We can see that in addition to a large portion of cells that do not do anything interpretible, about 5% of them turn out to have learned quite interesting and interpretible algorithms: Again, what is beautiful about this is that we didn’t have to hardcode at any point that if you’re trying to predict the next character it might, for example, be useful to keep track of whether or not you are currently inside or outside of quote.

I’ve only started working with Torch/LUA over the last few months and it hasn’t been easy (I spent a good amount of time digging through the raw Torch code on Github and asking questions on their gitter to get things done), but once you get a hang of things it offers a lot of flexibility and speed.

Here’s a brief sketch of a few recent developments (definitely not complete list, and a lot of this work draws from research back to 1990s, see related work sections): In the domain of NLP/Speech, RNNs transcribe speech to text, perform machine translation, generate handwritten text, and of course, they have been used as powerful language models (Sutskever et al.) (Graves) (Mikolov et al.) (both on the level of characters and words).

For example, we’re seeing RNNs in frame-level video classification, image captioning (also including my own work and many others), video captioning and very recently visual question answering.

My personal favorite RNNs in Computer Vision paper is Recurrent Models of Visual Attention, both due to its high-level direction (sequential processing of images with glances) and the low-level modeling (REINFORCE learning rule that is a special case of policy gradient methods in Reinforcement Learning, which allows one to train models that perform non-differentiable computation (taking glances around the image in this case)).

I’m confident that this type of hybrid model that consists of a blend of CNN for raw perception coupled with an RNN glance policy on top will become pervasive in perception, especially for more complex tasks that go beyond classifying some objects in plain view.

One problem is that RNNs are not inductive: They memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the correct way (I’ll provide pointers in a bit that make this more concrete).

This paper sketched a path towards models that can perform read/write operations between large, external memory arrays and a smaller set of memory registers (think of these as our working memory) where the computation happens.

Now, I don’t want to dive into too many details but a soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly).

Think of this as declaring a pointer in C that doesn’t point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!).

If you’d like to play with training RNNs I hear good things about keras or passage for Theano, the code released with this post for Torch, or this gist for raw numpy code I wrote a while ago that implements an efficient, batched LSTM forward and backward pass.

Unfortunately, at about 46K characters I haven’t written enough data to properly feed the RNN, but the returned sample (generated with low temperature to get a more typical sample) is: Yes, the post was about RNN and how well it works, so clearly this works :).

Learning to learn and compositionality with deep recurrent neural networks

Author: Nando de Freitas, Department of Computer Science, University of Oxford Abstract: Deep neural network representations play an important role in ...

Neural Network Tries to Generate English Speech (RNN/LSTM)

By popular demand, I threw my own voice into a neural network (3 times) and got it to recreate what it had learned along the way! This is 3 different recurrent ...

Neural Network Learns to Generate Voice (RNN/LSTM)

[VOLUME WARNING] This is what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what ...

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language ...

1. Hopfield Nets

Video from Coursera - University of Toronto - Course: Neural Networks for Machine Learning:

A friendly introduction to Recurrent Neural Networks

A friendly explanation of how computers predict and generate sequences, based on Recurrent Neural Networks. For a brush up on Neural Networks, check out ...

Convolutional Neural Networks - Ep. 8 (Deep Learning SIMPLIFIED)

Out of all the current Deep Learning applications, machine vision remains one of the most popular. Since Convolutional Neural Nets (CNN) are one of the best ...

Recurrent Neural Networks - Ep. 9 (Deep Learning SIMPLIFIED)

Our previous discussions of deep net applications were limited to static patterns, but how can a net decipher and label patterns that change with time?

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

A gentle walk through how they work and how they are useful. Some other helpful resources: RNN and LSTM slides: Luis Serrano's Friendly ..

3 neural nets battle to produce the best jazz music

Ooooh my, a vid is here Doug McKenzie's Jazz Piano: Andrej Karpathy's LSTM: ..