# AI News, The Unreasonable Effectiveness of Recurrent Neural Networks ## The Unreasonable Effectiveness of Recurrent Neural Networks

Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?” By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs.

A few examples may make this more concrete: As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems.

You might be thinking that having sequences as inputs or outputs could be relatively rare, but an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to process them in a sequential manner.

On the right, a recurrent network generates images of digits by learning to sequentially add color to a canvas (Gregor et al.): The takeaway is that even if your data is not in form of sequences, you can still formulate and train powerful models that learn to process it sequentially.

If you’re more comfortable with math notation, we can also write the hidden state update as $$h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t )$$, where tanh is applied elementwise.

We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs y you’d like to see in response to your input sequences x.

Here’s a diagram: For example, we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”.

Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers).

If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea.

And if you have to act the big company too.” Okay, clearly the above is unfortunately not going to replace Paul Graham anytime soon, but remember that the RNN had to learn English completely from scratch and with a small dataset (including where you put commas, apostrophes and spaces).

In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say: “is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same” looks like we’ve reached an infinite loop about startups.

There’s also quite a lot of structured markdown that the model learns, for example sometimes it creates headings, lists, etc.: Sometimes the model snaps into a mode of generating random but valid XML: The model completely makes up the timestamp, id, and so on.

We had to step in and fix a few issues manually but then you get plausible looking math, it’s quite astonishing: Here’s another sample: As you can see above, sometimes the model tries to generate latex diagrams, but clearly it hasn’t really figured them out.

This is an example of a problem we’d have to fix manually, and is likely due to the fact that the dependency is too long-term: By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma.

In particular, I took all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file (474MB of C code) (I was originally going to train only on the kernel but that by itself is only ~16MB).

This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code: There are too many fun parts to cover- I could probably write an entire blog post on just this part.

Of course, you can imagine this being quite useful inspiration when writing a novel, or naming a new startup :) We saw that the results at the end of training can be impressive, but how does any of this work?

At 300 iterations we see that the model starts to get an idea about quotes and periods: The words are now also separated with spaces and the model starts to get the idea about periods at the end of a sentence.

Longer words have now been learned as well: Until at last we start to get properly spelled words, quotations, names, and so on by about iteration 2000: The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words;

In the visualizations below we feed a Wikipedia RNN model character data from the validation set (shown along the blue/green rows) and under every character we visualize (in red) the top 5 guesses that the model assigns for the next character.

Think about it as green = very excited and blue = not very excited (for those familiar with details of LSTMs, these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state).

Below we’ll look at 4 different ones that I found and thought were interesting or interpretable (many also aren’t): Of course, a lot of these conclusions are slightly hand-wavy as the hidden state of the RNN is a huge, high-dimensional and largely distributed representation.

We can see that in addition to a large portion of cells that do not do anything interpretible, about 5% of them turn out to have learned quite interesting and interpretible algorithms: Again, what is beautiful about this is that we didn’t have to hardcode at any point that if you’re trying to predict the next character it might, for example, be useful to keep track of whether or not you are currently inside or outside of quote.

I’ve only started working with Torch/LUA over the last few months and it hasn’t been easy (I spent a good amount of time digging through the raw Torch code on Github and asking questions on their gitter to get things done), but once you get a hang of things it offers a lot of flexibility and speed.

Here’s a brief sketch of a few recent developments (definitely not complete list, and a lot of this work draws from research back to 1990s, see related work sections): In the domain of NLP/Speech, RNNs transcribe speech to text, perform machine translation, generate handwritten text, and of course, they have been used as powerful language models (Sutskever et al.) (Graves) (Mikolov et al.) (both on the level of characters and words).

For example, we’re seeing RNNs in frame-level video classification, image captioning (also including my own work and many others), video captioning and very recently visual question answering.

My personal favorite RNNs in Computer Vision paper is Recurrent Models of Visual Attention, both due to its high-level direction (sequential processing of images with glances) and the low-level modeling (REINFORCE learning rule that is a special case of policy gradient methods in Reinforcement Learning, which allows one to train models that perform non-differentiable computation (taking glances around the image in this case)).

I’m confident that this type of hybrid model that consists of a blend of CNN for raw perception coupled with an RNN glance policy on top will become pervasive in perception, especially for more complex tasks that go beyond classifying some objects in plain view.

One problem is that RNNs are not inductive: They memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the correct way (I’ll provide pointers in a bit that make this more concrete).

This paper sketched a path towards models that can perform read/write operations between large, external memory arrays and a smaller set of memory registers (think of these as our working memory) where the computation happens.

Now, I don’t want to dive into too many details but a soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly).

Think of this as declaring a pointer in C that doesn’t point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!).

If you’d like to play with training RNNs I hear good things about keras or passage for Theano, the code released with this post for Torch, or this gist for raw numpy code I wrote a while ago that implements an efficient, batched LSTM forward and backward pass.

Unfortunately, at about 46K characters I haven’t written enough data to properly feed the RNN, but the returned sample (generated with low temperature to get a more typical sample) is: Yes, the post was about RNN and how well it works, so clearly this works :).

## A Word of Caution on Scheduled Sampling for Training RNNs

Here is the link to the original paper: Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer (2015): Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks Google's team developed scheduled sampling as an alternative training procedure to fit RNNs, and they used it in their competition-winning method for image captioning.

While I can't argue with the empirical results (so I won't), I was a bit skeptical about the technique at a fundamental level, so I decided to do a bit of math that resulted in this blog post.

A scoring rule is called strictly proper, if for any $P$, the following holds: $$\underset{Q}{\operatorname{argmax}} \mathbb{E}_{x\sim P}S(x,Q) = P$$ In other words, if you repeatedly sample observations from some true underlying distribution $P$, then the model $Q$ which minimises expected score is $P$.

Unsupervised learning is all about modelling the probability distribution of data, so it's essential that we have loss functions that can measure the discrepancy between our model $Q$, and the true data distribution $P$ in a consistent way.

One of the most frequently used strictly proper scoring rule is the logarithmic score: $$S(x,Q) = - \log Q(x)$$ This quantity is also known as the negative log-likelihood.

To recap, any learning method that corresponds to minimising a strictly proper scoring rule is fine, everything else can go horribly wrong, even if we feed it infinite data, it might just learn the wrong thing.

After training, you use the RNN to generate sample sentences in a recursive fashion: assuming you've already generated $n$ characters, you feed that prefix into the RNN, and ask it to predict the $n+1$st character.

More specifically, for each character in the training sentences, we flip a coin to decide whether we feed the character from the real training sentence, or whether to feed the model's own prediction as to what that character would have been.

The scoring rule for selective sampling looks something like this: $$S(Q_{x_1,x_2},(x_1,x_2)) = - (1 - \epsilon) [ \mathbb{E}_{z \sim Q_{x_1}} \log Q_{x_2 \vert x_1}(x_2 \vert z) + \log Q_{x_1}(x_1)] - \epsilon \log Q_{x_2 , x_1}(x_1,x_2),$$ where $\epsilon$ is the probability with wich the true $x_1$ is used.

After some math, one can show that scheduled sampling with a fixed $\epsilon$ minimises the following divergence between the true $P$ and the model $Q$: $$D_{SS}[P\|Q] = KL[P_{x_1}\|Q_{x_1}] + (1-\epsilon) \mathbb{E}_{z\sim Q_{x_1}} KL[P_{x_2}\|Q_{x_2\vert x_1=z}] + \epsilon KL[P_{x_2\vert x_1}\|Q_{x_2\vert x_1}]$$ Now, if $\epsilon=1$, we recover the Kullback-Leibler divergence between the joint $P_{x_1,x_2}$ and $Q_{x_1,x_2}$, which is what we expect as it corresponds to maximum likelihood estimation.

However, as $\epsilon$ is annealed to $0$, the objective function is somewhat strange, whereby the conditional distribution $Q_{x_2\vert x_1}$ is pushed to model the marginal distribution $P_{x_2}$, instead of $P_{x_2\vert x_1}$ as one would expect.

The optimal model that minimises this objective would completely ignore all the characters in the sentence so far, but keep a simple linear counter that indexes where it is within the sentence.

I believe the reason why this trivial behaviour was not observed in the paper is that the authors did not run the optimisation until convergence, and did not implement the full gradient of the objective function, as they discuss in the paper.

The main reason for the observed problem is that the log-likelihood is a local scoring rule The local property of scoring rules means that at training time we only ever evaluate the model $Q$ on actually observed datapoints.

## Recurrent Neural Networks Tutorial, Part 1 &#8211; Introduction to RNNs

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks.

It&#8217;s a multi-part series in which I&#8217;m planning to cover the following: As part of the tutorial we will implement a recurrent neural network based language model.

The applications of language models are two-fold: First, it allows us to score arbitrary sentences based on how likely they are to occur in the real world.

If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations.

By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows: There are a few things to note here: RNNs have shown great success in many NLP tasks.

A side-effect of being able to predict the next word is that we get a generative model, which allows us to generate new text by sampling from the output probabilities.

In Language Modeling our input is typically a sequence of words (encoded as one-hot vectors for example), and our output is the sequence of predicted words.

A key difference is that our output only starts after we have seen the complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

Research papers about Machine Translation: Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.

Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps.

The memory in LSTMs are called cells and you can think of them as black boxes that take as input the previous state and current input .

I hope you&#8217;ve gotten a basic understanding of what RNNs are and what they can do. In the next post we&#8217;ll implement a first version of our language model RNN using Python and Theano. Please leave questions in the comments!

## A ten-minute introduction to sequence-to-sequence learning in Keras

This can be used for machine translation or for free-from question answering (generating a natural language answer given a natural language question) -- in

This is the case in this example script that shows how to teach a RNN to learn to add numbers, encoded as character strings:

In the general case, information about the entire input sequence is necessary in order to start generating the target sequence.

machine translation) and the entire input sequence is required in order to start predicting the target.

This requires a more advanced setup, which is what people commonly refer to when mentioning 'sequence to sequence models' with no further context.

We will implement a character-level sequence-to-sequence model, processing the input character-by-character and generating the output character-by-character.

It leverages three key features of Keras RNNs: We train our model in two lines, while monitoring the loss on a held-out set of 20% of the samples.

To decode a test sentence, we will repeatedly: Here's our inference setup: We use it to implement the inference loop described above: We get some nice results -- unsurprising since we are decoding samples taken from the training test.

Here's how: In some niche cases you may not be able to use teacher forcing, because you don't have access to the full target sequences, e.g.

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other sequence tasks, and bidirectional and deep...

2. Training RNNs with Back Propagation

Video from Coursera - University of Toronto - Course: Neural Networks for Machine Learning:

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

A gentle walk through how they work and how they are useful. Some other helpful resources: RNN and LSTM slides: Luis Serrano's Friendly Intro to RNNs:

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with attention as well as sequence model...

Google's DeepMind AI Just Taught Itself To Walk

Google's artificial intelligence company, DeepMind, has developed an AI that has managed to learn how to walk, run, jump, and climb without any prior guidance. The result is as impressive as...

Deep Learning Stockholm #3: Sequential data modeling with RNNs & Creative AI

Deep Learning Stockholm Meetup Meetup #3: Sequential data modeling with RNNs & Creative AI (meetup page: Its been a..

Computer evolves to generate baroque music!

I put the word "evolve" in there because you guys like "evolution" videos, but this computer is actually learning with gradient descent! All music in this video is either by Bach, Mozart,...

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language modeling and image captioning, and how...

Meet the Amazing Zenyatta

After a scrap with Hanzo, Zenyatta realizes that the life of some robots is rather unfair. ‣ Voice Actors ‣ DryeGuy - itsannachloem -