# AI News, rnn: recurrent neural networks

- On Saturday, March 10, 2018
- By Read More

## rnn: recurrent neural networks

Note: this repository is deprecated in favor of https://github.com/torch/rnn.

library includes documentation for the following objects: Modules that consider successive calls to forward as different time-steps in a sequence : Modules that forward entire sequences through a decorated AbstractRecurrent instance : Miscellaneous modules and criterions : Criterions used for handling sequential inputs and targets : To install this repository: Note that luarocks intall rnn now installs https://github.com/torch/rnn instead.

The following are example training scripts using this package : If you use rnn in your work, we'd really appreciate it if you could cite the following paper: Léonard, Nicholas, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim.

Most issues can be resolved by updating the various dependencies: If you are using CUDA : And don't forget to update this package : If that doesn't fix it, open and issue on github.

constructor takes a single argument : Argument rho is the maximum number of steps to backpropagate through time (BPTT). Sub-classes

can set this to a large number like 99999 (the default) if they want to backpropagate through the

Calling this method makes it possible to pad sequences with different lengths in the same batch with zero vectors.

other words, it is possible seperate unrelated sequences with a masked element.

So for example : The reverse order implements backpropagation through time (BPTT).

This method brings back all states to the start of the sequence buffers, i.e.

In training mode, the network remembers all previous rho (number of time-steps) states.

The nn.Recurrent(start, input, feedback, [transfer, rho, merge]) constructor takes 6 arguments: An RNN is used to process a sequence of inputs. Each

call to forward keeps a log of the intermediate states (the input and many Module.outputs) and

backward must be called in reverse order of the sequence of calls to forward in order

The step attribute is only reset to 1 when a call to the forget method is made. In

For a simple concise example of how to make use of this module, please consult the simple-recurrent-network.lua training

is actually the recommended approach as it allows RNNs to be stacked and makes the rnn

The actual implementation corresponds to the following algorithm: where W[s->q] is the weight matrix from s to q, t indexes the time-step, b[1->q]

the input, forget and output gates, as well as the hidden state are computed at one fellswoop.

This extends the FastLSTM class to enable faster convergence during training by zero-centering the input-to-hidden and hidden-to-hidden transformations. It

The hidden-to-hidden transition of each LSTM cell is normalized according to where the batch normalizing transform is: where hd is a vector of (pre)activations to be normalized, gamma, and beta are model parameters that determine the mean and standard deviation of the normalized activation.

eps is a regularization hyperparameter to keep the division numerically stable and E(hd) and E(σ(hd)) are the estimates of the mean and variance in the mini-batch respectively.

The authors recommend initializing gamma to a small value and found 0.1 to be the value that did not cause vanishing gradients.

To turn on batch normalization during training, do: where momentum is same as gamma in the equation above (defaults to 0.1), eps is defined above and affine is a boolean whose state determines if the learnable affine transform is turned off (false) or on (true, the default).

The nn.GRU(inputSize, outputSize [,rho [,p [, mono]]]) constructor takes 3 arguments likewise nn.LSTM or 4 arguments for dropout:

The actual implementation corresponds to the following algorithm: where W[s->q] is the weight matrix from s to q, t indexes the time-step, b[1->q] are the biases leading into q, σ() is Sigmoid, x[t] is the input and s[t] is the output of the module (eq.

examples/s is measured by the training speed at 1 epoch, so, it may have a disk IO bias.

In the benchmark, GRU utilizes a dropout after LookupTable, while BGRU, stands for Bayesian GRUs, uses dropouts on inner connections (naming as Ref.

To implement GRU, a simple module is added, which cannot be possible to build only using nn modules.

y_i = x_i + b, then negate all components if negate is true.

Which is used to implement s[t] = (1-z[t])h[t] + z[t]s[t-1] of GRU (see above Equation (4)).

The nn.MuFuRu(inputSize, outputSize [,ops [,rho]]) constructor takes 2 required arguments, plus optional arguments: The Multi-Function Recurrent Unit generalizes the GRU by allowing weightings of arbitrary composition operators to be learned.

As in the GRU, the reset gate is computed based on the current input and previous hidden state, and used to compute a new feature vector: where W[a->b] denotes the weight matrix from activation a to b, t denotes the time step, b[1->a] is the bias for activation a, and s[t-1]r[t] is the element-wise multiplication of the two vectors.

Unlike in the GRU, rather than computing a single update gate (z[t] in GRU), MuFuRU computes a weighting over an arbitrary number of composition operators.

composition operator is any differentiable operator which takes two vectors of the same size, the previous hidden state, and a new feature vector, and returns a new vector representing the new hidden state.

The GRU implicitly defines two such operations, keep and replace, defined as keep(s[t-1], v[t]) = s[t-1] and replace(s[t-1], v[t]) = v[t].

A proposes 6 additional operators, which all operate element-wise: The weightings of each operation are computed via a softmax from the current input and previous hidden state, similar to the update gate in the GRU.

The produced hidden state is then the element-wise weighted sum of the output of each operation.

where p[t][j] is the weightings for operation j at time step t, and sum in equation 5 is over all operators J.

I could use two sequencers : Using a Recursor, I make the same model with a single Sequencer : Actually, the Sequencer will wrap any non-AbstractRecurrent module automatically, so

increment the self.step attribute by 1, using a shared parameter clone for

build a Simple RNN for language modeling : Note : We could very well reimplement the LSTM module using the newer

A : Regularizing RNNs by Stabilizing Activations This module implements the norm-stabilization criterion: This module regularizes the hidden states of RNNs by minimizing the difference between the L2-norms

The Sequencer requires inputs and outputs to be of shape seqlen x batchsize x featsize :

openning { and closing } illustrate that the time-steps are elements of a Lua table, although it

batchsize is 2 as their are two independent sequences : { H, E, L, L, O } and { F, U, Z, Z, Y, }. The

featsize is 1 as their is only one feature dimension per character and each such character is of size 1. So

the input in this case is a table of seqlen time-steps where each time-step is represented by a batchsize x featsize Tensor.

For example, rnn : an instance of nn.AbstractRecurrent, can forward an input sequence one forward at a time: Equivalently, we can use a Sequencer to forward the entire input sequence at once: We can also forward Tensors instead of Tables : The Sequencer can also take non-recurrent Modules (i.e.

When mode='neither' (the default behavior of the class), the Sequencer will additionally call forget before each call to forward. When

values for argument mode are as follows : Calls the decorated AbstractRecurrent module's forget method.

This module is a faster version of nn.Sequencer(nn.FastLSTM(inputsize, outputsize)) : Each time-step is computed as follows (same as FastLSTM): A

input and seqlen x batchsize x outputsize for the output : Note that if you prefer to transpose the first two dimension (i.e.

is equivalent to calling maskZero(1) on a FastLSTM wrapped by a Sequencer: For maskzero = true, input sequences are expected to be seperated by tensor of zeros for a time step.

The computation of a time-step outlined in SeqLSTM is replaced with the following: The algorithm is outlined in ref.

A and benchmarked with state of the art results on the Google billion words dataset in ref.

gates i[t], f[t] and o[t] can be much larger than the actual input x[t] and output r[t]. For

This module is a faster version of nn.Sequencer(nn.GRU(inputsize, outputsize)) : Usage of SeqGRU differs from GRU in the same manner as SeqLSTM differs from LSTM.

Applies encapsulated fwd and bwd rnns to an input sequence in forward and reverse order. It

bwd rnn defaults to: For each step (in the original sequence), the outputs of both rnns are merged together using the

Such that the merge module is then initialized as : Internally, the BiSequencer is implemented by decorating a structure of modules that makes use

is the minimum requirement, as it would not make sense for the bwd rnn to remember future sequences.

Applies encapsulated fwd and bwd rnns to an input sequence in forward and reverse order. It

latter cannot be used for language modeling because the bwd rnn would be trained to predict the input it had just be fed as input.

The bwd rnn defaults to: While the fwd rnn will output representations for the last N-1 steps, the

missing outputs for each rnn ( the first step for the fwd, the last step for the bwd) will

last output elements will be padded with zeros for the missing fwd and bwd rnn outputs, respectively.

For each step (in the original sequence), the outputs of both rnns are merged together using the

differs in that the sequence length is fixed before hand and the input is repeatedly forwarded through

This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.

The only difference from MaskZero is that it reduces computational costs by varying a batch size, if any, for the case that varying lengths are provided in the input. Notice

The output Tensor will have each row zeroed when the commensurate row of the input is a zero index.

This lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.

This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.

- On Saturday, March 10, 2018
- By Read More

## Getting started with the Keras functional API

The Keras functional API is the way to go for defining complex models, such as multi-output models, directed acyclic graphs, or models with shared layers.

The main input to the model will be the headline itself, as a sequence of words, but to spice things up, our model will also have an auxiliary input, receiving extra data such as the time of day when the headline was posted, etc. The

At this point, we feed into the model our auxiliary input data by concatenating it with the LSTM output: This defines a model with two inputs and two outputs: We compile the model and assign a weight of 0.2 to the auxiliary loss. To

a sequence of 280 vectors of size 256, where each dimension in the 256-dimensional vector encodes the presence/absence of a character (out of an alphabet of 256 frequent characters).

To share a layer across different inputs, simply instantiate the layer once, then call it on as many inputs as you want: Let's pause to take a look at how to read the shared layer's output or output shape.

Whenever you are calling a layer on some input, you are creating a new tensor (the output of the layer), and you are adding a 'node' to the layer, linking the input tensor to the output tensor.

The same is true for the properties input_shape and output_shape: as long as the layer has only one node, or as long as all nodes have the same input/output shape, then the notion of 'layer output/input shape' is well defined, and that one shape will be returned by layer.output_shape/layer.input_shape.

But if, for instance, you apply the same Conv2D layer to an input of shape (32, 32, 3), and then to an input of shape (64, 64, 3), the layer will have multiple input/output shapes, and you will have to fetch them by specifying the index of the node they belong to: Code examples are still the best way to get started, so here are a few more.

- On Thursday, September 19, 2019

**Multiple Input RNN with Keras**

An introduction to multiple-input RNNs with Keras and Tensorflow. This is the first in a series of videos I'll make to share somethings I've learned about Keras, Google Cloud ML, RNNs, and...

**LSTM input output shape , Ways to improve accuracy of predictions in Keras**

In this tutorial we look at how we decide the input shape and output shape for an LSTM. We also tweak various parameters like Normalization, Activation and the loss function and see their...

**RNN Example in Tensorflow - Deep Learning with Neural Networks 11**

In this deep learning with TensorFlow tutorial, we cover how to implement a Recurrent Neural Network, with an LSTM (long short term memory) cell with the MNIST dataset.

**Recurrent Neural Networks (RNN / LSTM )with Keras - Python**

In this tutorial, we learn about Recurrent Neural Networks (LSTM and RNN). Recurrent neural Networks or RNNs have been very successful and popular in time series data predictions. There are...

**10.3: Programming LSTM with Keras and TensorFlow (Module 10, Part 3)**

Programming LSTM for Keras and Tensorflow in Python. This includes and example of predicting sunspots. This video is part of a course that is taught in a hybrid format at Washington University...

**Lecture 10 | Recurrent Neural Networks**

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language modeling and image captioning, and how...

**10.1: Time Series Data Encoding for Deep Learning, TensorFlow and Keras (Module 10, Part 1)**

How to represent data for time series neural networks. This includes recurrent neural network (RNN) types of LSTM and GRU. This video is part of a course that is taught in a hybrid format...

**What are Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) ?**

Recurrent Neural Networks or RNN have been very popular and effective with time series data. In this tutorial, we learn about RNNs, the Vanishing Gradient problem and the solution to the problem...

**Deep Learning Lecture 13: Applying RNN's to Sentiment Analysis**

Get my larger machine learning course at We'll practice using recurrent neural networks..

**Time Series Forecasting with LSTM Deep Learning**

A quick tutorial on Time Series Forecasting with Long Short Term Memory Network (LSTM), Deep Learning Techniques. The detailed Jupyter Notebook is available at