AI News, Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients

Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients

We will then try to understand the vanishing gradient problem, which has led to the development of  LSTMs and GRUs, two of the currently most popular and powerful models used in NLP (and other areas).

The vanishing gradient problem was originally discovered by Sepp Hochreiter in 1991 and has been receiving attention again recently due to the increased application of deep architectures.

Here, is the correct word at time step , and is our prediction. We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).

Remember that our goal is to calculate the gradients of the error with respect to our parameters and and then learn good parameters using Stochastic Gradient Descent.

Don’t worry if you don’t follow the above, I skipped several steps and you can try calculating these derivatives yourself (good exercise!).

In code, a naive implementation of BPTT looks something like this: This should also give you an idea of why standard RNNs are hard to train: Sequences (sentences) can be quite long, perhaps 20 words or more, and thus you need to back-propagate through many layers.

Also note that because we are taking the derivative of a vector function with respect to a vector, the result is a matrix (called the Jacobian matrix) whose elements are all the pointwise derivatives.

It turns out (I won’t prove it here but this paper goes into detail) that the 2-norm, which you can think of it as an absolute value, of the above Jacobian matrix has an upper bound of 1.

This makes intuitive sense because our (or sigmoid) activation function maps all values into a range between -1 and 1, and the derivative is bounded by 1 (1/4 in the case of sigmoid) as well: You can see that the and sigmoid functions have derivatives of 0 at both ends.

Thus, with small values in the matrix and multiple matrix multiplications ( in particular) the gradient values are shrinking exponentially fast, eventually vanishing completely after a few time steps.

It is easy to imagine that, depending on our activation functions and network parameters, we could get exploding instead of vanishing gradients if the values of the Jacobian matrix are large.

GRUs, first proposed in 2014, are simplified versions of LSTMs. Both of these RNN architectures were explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies.

How does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network?

Generally, RNNs have transition function with an affine transformation followed by a pointwise non-linearity eg, the hyperbolic tangent function: h( t )= tanh (W x( t) + U h( t−1) + b) The gradients in deeper layers are calculated as product of differentials.

Memory cell: c(t) = i(t) • u(t )+ f(t) • c(t-1), followed by, Hidden layer: h(t) = o(t) • tanh(c(t)), where i=input gate, u=update gate, f=forget gate and c(t-1)=previous value.

Vanishing / Exploding Gradients

This video is part of the Udacity course "Deep Learning". Watch the full course at

Recurrent Neural Networks - Ep. 9 (Deep Learning SIMPLIFIED)

Our previous discussions of deep net applications were limited to static patterns, but how can a net decipher and label patterns that change with time? For example, could a net be used to scan...

4. Why it is Difficult to Train an RNN

Video from Coursera - University of Toronto - Course: Neural Networks for Machine Learning:

Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorflow Tutorial | Edureka

TensorFlow Training - ) This Edureka Recurrent Neural Networks tutorial video (Blog: will help you in understanding.

What are Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) ?

Recurrent Neural Networks or RNN have been very popular and effective with time series data. In this tutorial, we learn about RNNs, the Vanishing Gradient problem and the solution to the problem...

Neural Networks Demystified [Part 4: Backpropagation]

Backpropagation as simple as possible, but no simpler. Perhaps the most misunderstood part of neural networks, Backpropagation of errors is the key step that allows ANNs to learn. In this video,...

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language modeling and image captioning, and how...

How to Generate Music - Intro to Deep Learning #9

We're going to build a music generating neural network trained on jazz songs in Keras. I'll go over the history of algorithmic generation, then we'll walk step by step through the process of...

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: Language Models. RNN. Bi-directional...

Deep Learning with Tensorflow - The Long Short Term Memory Model

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data in the world is unlabeled..