AI News, Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano

Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano

 To understand what this means, let’s look at how a LSTM calculates a hidden state (I’m using to mean elementwise multiplication):

With that in mind let’s try to get an intuition for how a LSTM unit computes the hidden state. Chris Olah has an excellent post that goes into details on this and to avoid duplicating his effort I will only give a brief explanation here.

If you fix the input gate all 1’s, the forget gate to all 0’s (you always forget the previous memory) and the output gate to all one’s (you expose the whole memory) you almost get standard RNN.

A common one is creating peephole connections that allow the gates to not only depend on the previous hidden state , but also on the previous internal state , adding an additional term in the gate equations.

The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM, but there are a few key differences: Now that you’ve seen two models  to combat the vanishing gradient problem you may be wondering: Which one to use?

In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture.

If you are for somehow forced to calculate the gradients yourself, you probably want to modularize different units and have your own version of auto-differentiation using the chain rule.

It turns out this isn’t such a great idea. If you set your learning rate low enough, SGD is guaranteed to make progress towards a good solution, but in practice that would take a very long time. There exist a number of commonly used variations on SGD, including the (Nesterov) Momentum Method, AdaGrad, AdaDelta and rmsprop.

The basic idea behind rmsprop is to adjust the learning rate per-parameter according to the a (smoothed) sum of the previous gradients.

Intuitively this means that frequently occurring features get a smaller learning rate (because the sum of their gradients is larger), and rare features get a larger learning rate.

For each parameter we keep a cache variable and during gradient descent we update the parameter and the cache as follows (example for ): The decay is typically set to 0.9 or 0.95 and the 1e-6 term is added to avoid division by 0.

I didn’t use pre-trained word vectors in my experiments, but adding an embedding layer (the matrix in our code) makes it easy to plug them in.

the ith column vector corresponds to the ith word in our vocabulary. By updating the matrix we are learning word vectors ourselves, but they are very specific to our task (and data set) and not as general as those that you can download, which are trained on millions or billions of documents.

You’ll likely see diminishing returns after 2-3 layers and unless you have a huge amount of data (which we don’t) more layers are unlikely to make a big difference and may lead to overfitting.

Instead of learning from one sentence at a time, you want to group sentences of the same length (or even pad all sentences to have the same length) and then perform large matrix multiplications and sum up gradients for the whole batch. That’s because such large matrix multiplications are efficiently handled by a GPU.

Subscribe to our mailing list

Recurrent nets are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, or numerical times series data emanating from sensors, stock markets and government agencies.

Since recurrent networks possess a certain type of memory, and memory is also part of the human condition, we’ll make repeated analogies to memory in the brain.1 To understand recurrent nets, first you have to understand the basics of feedforward nets.

Here’s a diagram of an early, simple recurrent net proposed by Elman, where the BTSXPE at the bottom of the drawing represents the input example in the current moment, and CONTEXT UNIT represents the output of the previous moment.

It is often said that recurrent networks have memory.2 Adding memory to neural networks has a purpose: There is information in the sequence itself, and recurrent nets use it to perform tasks that feedforward networks can’t.

That sequential information is preserved in the recurrent network’s hidden state, which manages to span many time steps as it cascades forward to affect the processing of each new example.

It is finding correlations between events separated by many moments, and these correlations are called “long-term dependencies”, because an event downstream in time depends upon, and is a function of, one or more events that came before.

Just as human memory circulates invisibly within a body, affecting our behavior without revealing its full shape, information circulates in the hidden states of recurrent nets.

It is a function of the input at the same time step x_t, modified by a weight matrix W (like the one we used for feedforward nets) added to the hidden state of the previous time step h_t-1 multiplied by its own hidden-state-to-hidden-state matrix U, otherwise known as a transition matrix and similar to a Markov chain.

The sum of the weight input and hidden state is squashed by the function φ – either a logistic sigmoid function or tanh, depending – which is a standard tool for condensing very large or very small values into a logistic space, as well as making gradients workable for backpropagation.

Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previous hidden state, but also of all those that preceded h_t-1 for as long as memory can persist.

Given a series of letters, a recurrent network will use the first character to help determine its perception of the second character, such that an initial q might lead it to infer that the next letter will be u, while an initial t might lead it to infer that the next letter will be h.

Since recurrent nets span time, they are probably best illustrated with animation (the first vertical line of nodes to appear can be thought of as a feedforward network, which becomes recurrent as it unfurls over time).

In the diagram above, each x is an input example, w is the weights that filter inputs, a is the activation of the hidden layer (a combination of weighted input and the previous hidden state), and b is the output of the hidden layer after it has been transformed, or squashed, using a rectified linear or sigmoid unit.

Backpropagation in feedforward networks moves backward from the final error through the outputs, weights and inputs of each hidden layer, assigning those weights responsibility for a portion of the error by calculating their partial derivatives – ∂E/∂w, or the relationship between their rates of change.

Everyone who has studied compound interest knows that any quantity multiplied frequently by an amount slightly greater than one can become immeasurably large (indeed, that simple mathematical truth underpins network effects and inevitable social inequalities).

By maintaining a more constant error, they allow recurrent nets to continue to learn over many time steps (over 1000), thereby opening a channel to link causes and effects remotely.

(Religious thinkers have tackled this same problem with ideas of karma or divine reward, theorizing invisible and distant consequences to our actions.) LSTMs contain information outside the normal flow of the recurrent network in a gated cell.

Those gates act on the signals they receive, and similar to the neural network’s nodes, they block or pass on information based on its strength and import, which they filter with their own sets of weights.

That is, the cells learn when to allow data to enter, leave or be deleted through the iterative process of making guesses, backpropagating error, and adjusting weights via gradient descent.

The black dots are the gates themselves, which determine respectively whether to let new input in, erase the present cell state, and/or let that state impact the network’s output at the present time step.

The forget gate is represented as a linear identity function, because if the gate is open, the current state of the memory cell is simply multiplied by one, to propagate forward one more time step.

If you’re analyzing a text corpus and come to the end of a document, for example, you may have no reason to believe that the next document has any relationship to it whatsoever, and therefore the memory cell should be set to zero before the net ingests the first element of the next document.

You may also wonder what the precise value is of input gates that protect a memory cell from new data coming in, and output gates that prevent it from affecting certain outputs of the RNN.

Here’s what the LSTM configuration looks like: Here are a few ideas to keep in mind when manually optimizing hyperparameters for RNNs: 1) While recurrent networks may seem like a far cry from general artificial intelligence, it’s our belief that intelligence, in fact, is probably dumber than we thought.

Recurrent networks, which also go by the name of dynamic (translation: “changing”) neural networks, are distinguished from feedforward nets not so much by having memory as by giving particular weight to events that occur in a series.

On the other hand, we also learn as children to decipher the flow of sound called language, and the meanings we extract from sounds such as “toe” or “roe” or “z” are always highly dependent on the sounds preceding (and following) them.

Understanding GRU networks

In this article, I will try to give a fairly simple and understandable explanation of one really fascinating type of neural network.

To explain the mathematics behind that process we will examine a single unit from the following recurrent neural network: Here is a more detailed version of the that single GRU: First, let’s introduce the notations: If you are not familiar with the above terminology, I recommend watching these tutorials about “sigmoid” and “tanh” function and “Hadamard product” operation.

We start with calculating the update gate z_t for time step t using the formula: When x_t is plugged into the network unit, it is multiplied by its own weight W(z).

The schema below shows where the reset gate is: As before, we plug in h_(t-1) — blue line and x_t — purple line, multiply them with their corresponding weights, sum the results and apply the sigmoid function.

It is calculated as follows: You can clearly see the steps here: We do an element-wise multiplication of h_(t-1) — blue line and r_t — orange line and then sum the result — pink line with the input x_t — purple line.

Since z_t will be close to 1 at this time step, 1-z_t will be close to 0 which will ignore big portion of the current content (in this case the last part of the review which explains the book plot) which is irrelevant for our prediction.

Here is an illustration which emphasises on the above equation: Following through, you can see how z_t — green line is used to calculate 1-z_t which, combined with h’_t — bright green line, produces a result in the dark red line.

GRUs vs. LSTMs

Notes on Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Overall impression: The authors seem to recognize that their study does not produce any novel ideas or breakthroughs (that’s okay!

However, they were mostly impractical due to the vanishing and exploding gradient problems during training until the introduction of the long-short term memory (LSTM) recurrent unit as a more complex activation function to confidently capture long-term dependencies.

the input gate regulates how much of the new cell state to keep, the forget gate regulates how much of the existing memory to forget, and the output gate regulates how much of the cell state should be exposed to the next layers of the network.

The reset gate sits between the previous activation and the next candidate activation to forget previous state, and the update gate decides how much of the candidate activation to use in updating the cell state.

Both LSTMs and GRUs have the ability to keep memory/state from previous activations rather than replacing the entire activation like a vanilla RNN, allowing them to remember features for a long time and allowing backpropagation to happen through multiple bounded nonlinearities, which reduces the likelihood of the vanishing gradient.

The authors will build an LSTM model, a GRU model, and a vanilla RNN model and compare their performances using a log-likelihood loss function over polyphonic music modelling and speech signal modelling datasets.

The authors built models for each of their three test units (LSTM, GRU, tanh) along the following criteria: They tested their models across four music datasets and two speech datasets.

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: ...

LSTM Network (Recurrent Neural Network / GRU)

Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often ...

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language ...

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

A gentle walk through how they work and how they are useful. Some other helpful resources: RNN and LSTM slides: Luis Serrano's Friendly ..

Lecture 11: Gated Recurrent Units and Further Topics in NMT

Lecture 11 provides a final look at gated recurrent units like GRUs/LSTMs followed by machine translation evaluation, dealing with large vocabulary output, and ...

10.2: Introduction to LSTM and GRU for Deep Learning (Module 10, Part 2)

Introduction to recurrent neural networks LSTM and GRU. This video is part of a course that is taught in a hybrid format at Washington University in St. Louis; ...

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other ...

Evolution: from vanilla RNN to GRU & LSTMs (How it works) [En]

This lecture is about most popular RNN cells: - vanilla RNN - GRU - LSTM cell - LSTM with peephole connections. Intuition, what's inside, how it works, ...

Deep Learning with Tensorflow - The Long Short Term Memory Model

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..