AI News, Sequence prediction using recurrent neural networks(LSTM) with TensorFlow

Sequence prediction using recurrent neural networks(LSTM) with TensorFlow

This post tries to demonstrates how to approximate a sequence of vectors using a recurrent neural networks, in particular I will be using the LSTM architecture, The complete code used for this post could be found here.

Most of the examples I found in the internet apply the LSTM architecture to natural language processing problems, and I couldn’t find an example where this architecture could be used to predict continuous values.

The traditional neural networks architectures can’t do this, this is why recurrent neural networks were made to address this issue, as they allow to store previous information to predict future event.

N.B I am not completely sure if this is the right way to train lstm on regression problems, I am still experimenting with the RNN sequence-to-sequence model, I will update this post or write a new one to use the sequence-to-sequence model.

Understanding LSTM Networks

In the above diagram, a chunk of neural network, \(A\), looks at some input \(x_t\) and outputs a value \(h_t\).

Consider what happens if we unroll the loop: This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists.

In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version.

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame.

If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky.

Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form.

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers.

Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.

This decision is made by a sigmoid layer called the “forget gate layer.” It looks at \(h_{t-1}\) and \(x_t\), and outputs a number between \(0\) and \(1\) for each number in the cell state \(C_{t-1}\).

A \(1\) represents “completely keep this” while a \(0\) represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Then, we put the cell state through \(\tanh\) (to push the values to be between \(-1\) and \(1\)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next.

For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes.

For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs.

The Unreasonable Effectiveness of Recurrent Neural Networks

Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?” By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs.

A few examples may make this more concrete: As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems.

You might be thinking that having sequences as inputs or outputs could be relatively rare, but an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to process them in a sequential manner.

On the right, a recurrent network generates images of digits by learning to sequentially add color to a canvas (Gregor et al.): The takeaway is that even if your data is not in form of sequences, you can still formulate and train powerful models that learn to process it sequentially.

If you’re more comfortable with math notation, we can also write the hidden state update as \( h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t ) \), where tanh is applied elementwise.

We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs y you’d like to see in response to your input sequences x.

Here’s a diagram: For example, we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”.

Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers).

If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea.

And if you have to act the big company too.” Okay, clearly the above is unfortunately not going to replace Paul Graham anytime soon, but remember that the RNN had to learn English completely from scratch and with a small dataset (including where you put commas, apostrophes and spaces).

In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say: “is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same” looks like we’ve reached an infinite loop about startups.

There’s also quite a lot of structured markdown that the model learns, for example sometimes it creates headings, lists, etc.: Sometimes the model snaps into a mode of generating random but valid XML: The model completely makes up the timestamp, id, and so on.

We had to step in and fix a few issues manually but then you get plausible looking math, it’s quite astonishing: Here’s another sample: As you can see above, sometimes the model tries to generate latex diagrams, but clearly it hasn’t really figured them out.

This is an example of a problem we’d have to fix manually, and is likely due to the fact that the dependency is too long-term: By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma.

In particular, I took all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file (474MB of C code) (I was originally going to train only on the kernel but that by itself is only ~16MB).

This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code: There are too many fun parts to cover- I could probably write an entire blog post on just this part.

Of course, you can imagine this being quite useful inspiration when writing a novel, or naming a new startup :) We saw that the results at the end of training can be impressive, but how does any of this work?

At 300 iterations we see that the model starts to get an idea about quotes and periods: The words are now also separated with spaces and the model starts to get the idea about periods at the end of a sentence.

Longer words have now been learned as well: Until at last we start to get properly spelled words, quotations, names, and so on by about iteration 2000: The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words;

In the visualizations below we feed a Wikipedia RNN model character data from the validation set (shown along the blue/green rows) and under every character we visualize (in red) the top 5 guesses that the model assigns for the next character.

Think about it as green = very excited and blue = not very excited (for those familiar with details of LSTMs, these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state).

Below we’ll look at 4 different ones that I found and thought were interesting or interpretable (many also aren’t): Of course, a lot of these conclusions are slightly hand-wavy as the hidden state of the RNN is a huge, high-dimensional and largely distributed representation.

We can see that in addition to a large portion of cells that do not do anything interpretible, about 5% of them turn out to have learned quite interesting and interpretible algorithms: Again, what is beautiful about this is that we didn’t have to hardcode at any point that if you’re trying to predict the next character it might, for example, be useful to keep track of whether or not you are currently inside or outside of quote.

I’ve only started working with Torch/LUA over the last few months and it hasn’t been easy (I spent a good amount of time digging through the raw Torch code on Github and asking questions on their gitter to get things done), but once you get a hang of things it offers a lot of flexibility and speed.

Here’s a brief sketch of a few recent developments (definitely not complete list, and a lot of this work draws from research back to 1990s, see related work sections): In the domain of NLP/Speech, RNNs transcribe speech to text, perform machine translation, generate handwritten text, and of course, they have been used as powerful language models (Sutskever et al.) (Graves) (Mikolov et al.) (both on the level of characters and words).

For example, we’re seeing RNNs in frame-level video classification, image captioning (also including my own work and many others), video captioning and very recently visual question answering.

My personal favorite RNNs in Computer Vision paper is Recurrent Models of Visual Attention, both due to its high-level direction (sequential processing of images with glances) and the low-level modeling (REINFORCE learning rule that is a special case of policy gradient methods in Reinforcement Learning, which allows one to train models that perform non-differentiable computation (taking glances around the image in this case)).

I’m confident that this type of hybrid model that consists of a blend of CNN for raw perception coupled with an RNN glance policy on top will become pervasive in perception, especially for more complex tasks that go beyond classifying some objects in plain view.

One problem is that RNNs are not inductive: They memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the correct way (I’ll provide pointers in a bit that make this more concrete).

This paper sketched a path towards models that can perform read/write operations between large, external memory arrays and a smaller set of memory registers (think of these as our working memory) where the computation happens.

Now, I don’t want to dive into too many details but a soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly).

Think of this as declaring a pointer in C that doesn’t point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!).

If you’d like to play with training RNNs I hear good things about keras or passage for Theano, the code released with this post for Torch, or this gist for raw numpy code I wrote a while ago that implements an efficient, batched LSTM forward and backward pass.

Unfortunately, at about 46K characters I haven’t written enough data to properly feed the RNN, but the returned sample (generated with low temperature to get a more typical sample) is: Yes, the post was about RNN and how well it works, so clearly this works :).

Multi-state LSTMs for categorical features

Every time a student’s label is fed to the Multi-state LSTM cell, it starts by looking up the corresponding state and previous score.

This modified version of the basic LSTM cell works similarly but has an additional advantage : it can now directly take in a sequence of student labels as inputs : For the sake of our previous example we considered the “user_id” categorical feature, but this can actually be applied to all of our categorical features.

From here, the complete network we used is pretty straightforward : All in all, this architecture tries to learn separate histories for each modality of each categorical feature then uses the LSTM’s outputs to predict a success probability for a given student on a given exercise.

But the predicted distributions are very different : In fact, the LSTM-based model seems to separate 3 to 4 different populations : Conversely, the baseline only predicts a skewed normal distribution around the overall average success rate with no particular discrimination.

In fact, judging by the MSE alone, the two models have similar performance : But as we discussed in a previous post, when real randomness is associated with the targets (here, predicting human behavior) we get a better assessment of the performance of a classifier by looking at its reliability measure (REL).

Stacked Long Short-Term Memory Networks

The original LSTM model is comprised of a single hidden LSTM layer followed by a standard feedforward output layer.

This post is divided into 3 parts, they are: Stacking LSTM hidden layers makes the model deeper, more accurately earning the description as a deep learning technique.

It is the depth of neural networks that is generally attributed to the success of the approach on a wide range of challenging prediction problems.

[the success of deep neural networks] is commonly attributed to the hierarchy that is introduced due to the several layers.

In this sense, the DNN can be seen as a processing pipeline, in which each layer solves a part of the task before passing it on to the next, until finally the last layer provides the output.

The additional hidden layers are understood to recombine the learned representation from prior layers and create new representations at high levels of abstraction.

Given that LSTMs operate on sequence data, it means that the addition of layers adds levels of abstraction of input observations over time.

Speech Recognition With Deep Recurrent Neural Networks, 2013 In the same work, they found that the depth of the network was more important than the number of memory cells in a given layer to model skill.

When an LSTM processes one input sequence of time steps, each memory cell will output a single value for the whole sequence as a 2D array.

Below is an example of defining a two hidden layer Stacked LSTM: We can continue to add hidden LSTM layers as long as the prior LSTM layer provides a 3D output as input for the subsequent layer;

LSTM Part 1

Using Keras to implement LSTMs. LSTMs are a certain set of RNNs that perform well compared to vanilla LSTMs.

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

A gentle walk through how they work and how they are useful. Some other helpful resources: RNN and LSTM slides: Luis Serrano's Friendly ..

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language ...

How to Make a Text Summarizer - Intro to Deep Learning #10

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, ...

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other ...

How to Make an Amazing Tensorflow Chatbot Easily

We'll go over how chatbots have evolved over the years and how Deep Learning has made them way better. Then we'll build our own chatbot using the ...

How to Make a Language Translator - Intro to Deep Learning #11

Let's build our own language translator using Tensorflow! We'll go over several translation methods and talk about how Google Translate is able to achieve state ...

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: ...

Recurrent Neural Networks | Lecture 11

Earlier videos: References and further reading: Deep Learning by Ian ..

Predicting Sales during Hurricanes for a Home Improvement Retailer via TensorFlow (Cloud Next '18)

When a hurricane is bearing down on a major metropolitan area, accurately predicting the extreme spikes in demand for home improvement categories is ...