AI News, Understanding LSTM Networks

Understanding LSTM Networks

In the above diagram, a chunk of neural network, \(A\), looks at some input \(x_t\) and outputs a value \(h_t\).

Consider what happens if we unroll the loop: This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists.

In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version.

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame.

If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky.

Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form.

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers.

Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.

This decision is made by a sigmoid layer called the “forget gate layer.” It looks at \(h_{t-1}\) and \(x_t\), and outputs a number between \(0\) and \(1\) for each number in the cell state \(C_{t-1}\).

A \(1\) represents “completely keep this” while a \(0\) represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Then, we put the cell state through \(\tanh\) (to push the values to be between \(-1\) and \(1\)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next.

For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes.

For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs.

What are Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) ?

Recurrent Neural Networks or RNN have been very popular and effective with time series data. In this tutorial, we learn about RNNs, the Vanishing Gradient ...

Deep Learning Lecture 12: Recurrent Neural Nets and LSTMs

Slides available at: Course taught in 2015 at the University of Oxford by Nando de Freitas with ..

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language ...


This video is part of the Udacity course "Deep Learning". Watch the full course at

Neural Network Tries to Generate English Speech (RNN/LSTM)

By popular demand, I threw my own voice into a neural network (3 times) and got it to recreate what it had learned along the way! This is 3 different recurrent ...

Professor Forcing Recurrent Neural Networks (NIPS 2016 Spotlight)

Spotlight video for NIPS 2016 Paper: Professor Forcing: A new algorithm for training recurrent networks

DeepProbe: Information Directed Sequence Understanding via Recurrent Neural Networks

DeepProbe: Information Directed Sequence Understanding via Recurrent Neural Networks Zi Yin (Stanford University) Keng-Hao Chang (Microsoft) Ruofei ...

Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

Lecture 14 looks at compositionality and recursion followed by structure prediction with simple Tree RNN: Parsing. Research highlight ""Deep Reinforcement ...

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: ...

Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences

Recurrent Neural Networks (RNNs) have become the state-of-the-art choice for extracting patterns from temporal sequences. Current RNN models are ill suited ...