AI News, Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras

Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras

Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence.

What makes this problem difficult is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model to learn the long-term context or dependencies between symbols in the input sequence.

In this post, you will discover how you can develop LSTM recurrent neural network models for sequence classification problems in Python using the Keras deep learning library.

The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test.

We will map each movie review into a real vector domain, a popular technique when working with text called word embedding.

This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.

Let’s start off by importing the classes and functions required for this model and initializing the random number generator to a constant value to ensure we can easily reproduce the results.

The model will learn the zero values carry no information so indeed the sequences are not the same length in terms of content, but same length vectors is required to perform the computation in Keras.

Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem.

For example, we can modify the first example to add dropout to the input and recurrent connections as follows: The full code listing with more precise LSTM dropout is listed below for completeness.

Dropout is a powerful technique for combating overfitting in your LSTM models and it is a good idea to try both methods, but you may bet better results with the gate-specific dropout provided in Keras.

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment.

Getting started with the Keras Sequential model

You can create a Sequential model by passing a list of layer instances to the constructor: You can also simply add layers via the .add() method: The model needs to know what input shape it should expect.

For this reason, the first layer in a Sequential model (and only the first, because following layers can do automatic shape inference) needs to receive information about its input shape.

There are several possible ways to do this: As such, the following snippets are strictly equivalent: Before training a model, you need to configure the learning process, which is done via the compile method.

Deep Learning for NLP Best Practices

While many existing Deep Learning libraries already encode best practices for working with neural networks in general, such as initialization schemes, many other details, particularly task or domain-specific considerations, are left to the practitioner.

While many of these features will be most useful for pushing the state-of-the-art, I hope that wider knowledge of them will lead to stronger evaluations, more meaningful comparison to baselines, and inspiration by shaping our intuition of what works.

I will then outline practices that are relevant for the most common tasks, in particular classification, sequence labelling, natural language generation, and neural machine translation.

The optimal dimensionality of word embeddings is mostly task-dependent: a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al., 2016) [44] or part-of-speech (POS) tagging (Plank et al., 2016) [32], while a larger dimensionality is more useful for more semantic tasks such as sentiment analysis (Ruder et al., 2016) [45].

First let us assume a one-layer MLP, which applies an affine transformation followed by a non-linearity \(g\) to its input \(\mathbf{x}\): \(\mathbf{h} = g(\mathbf{W}\mathbf{x} + \mathbf{b})\) A

highway layer then computes the following function instead: \(\mathbf{h} = \mathbf{t} \odot g(\mathbf{W} \mathbf{x} + \mathbf{b}) + (1-\mathbf{t}) \odot \mathbf{x} \) where \(\odot\) is elementwise multiplication, \(\mathbf{t} = \sigma(\mathbf{W}_T \mathbf{x} + \mathbf{b}_T)\) is called the transform gate, and \((1-\mathbf{t})\) is called the carry gate.

Residual connections are even more straightforward than highway layers and learn the following function: \(\mathbf{h} = g(\mathbf{W}\mathbf{x} + \mathbf{b}) + \mathbf{x}\) which simply adds the input of the current layer to its output via a short-cut connection.

Dense connections   Rather than just adding layers from each layer to the next, dense connections (Huang et al., 2017) [7] (best paper award at CVPR 2017) add direct connections from each layer to all subsequent layers.

They have also found to be useful for Multi-Task Learning of different NLP tasks (Ruder et al., 2017) [49], while a residual variant that uses summation has been shown to consistently outperform residual connections for neural machine translation (Britz et al., 2017) [27].

While batch normalisation in computer vision has made other regularizers obsolete in most applications, dropout (Srivasta et al., 2014) [8] is still the go-to regularizer for deep neural networks in NLP.

Recurrent dropout has been used for instance to achieve state-of-the-art results in semantic role labelling (He et al., 2017) and language modelling (Melis et al., 2017) [34].

While we can already predict surrounding words in order to pre-train word embeddings (Mikolov et al., 2013), we can also use this as an auxiliary objective during training (Rei, 2017) [35].

Using attention, we obtain a context vector \(\mathbf{c}_i\) based on hidden states \(\mathbf{s}_1, \ldots, \mathbf{s}_m\) that can be used together with the current hidden state \(\mathbf{h}_i\) for prediction.

The context vector \(\mathbf{c}_i\) at position is calculated as an average of the previous states weighted with the attention scores \(\mathbf{a}_i\): \(\begin{align}\begin{split}\mathbf{c}_i = \sum\limits_j a_{ij}\mathbf{s}_j\\ \mathbf{a}_i

Additive attention   The original attention mechanism (Bahdanau et al., 2015) [15] uses a one-hidden layer feed-forward network to calculate the attention alignment: \(f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_a[\mathbf{h}_i;

Analogously, we can also use matrices \(\mathbf{W}_1\) and \(\mathbf{W}_2\) to learn separate transformations for \(\mathbf{h}_i\) and \(\mathbf{s}_j\) respectively, which are then summed: \(f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j) \) Multiplicative attention   Multiplicative attention (Luong et al., 2015) [16] simplifies the attention operation by calculating the following function: \(f_{att}(h_i, s_j) = h_i^\top \mathbf{W}_a s_j \) Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication.

Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [37].

Self-attention   Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [18].

Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [38], textual entailment (Parikh et al., 2016) [39], and abstractive summarization (Paulus et al., 2017) [40].

We can simplify additive attention to compute the unnormalized alignment score for each hidden state \(\mathbf{h}_i\): \(f_{att}(\mathbf{h}_i) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_a \mathbf{h}_i) \) In matrix form, for hidden states \(\mathbf{H} = \mathbf{h}_1, \ldots, \mathbf{h}_n\) we can calculate the attention vector \(\mathbf{a}\) and the final sentence representation \(\mathbf{c}\) as follows: \(\begin{align}\begin{split}\mathbf{a} = \text{softmax}(\mathbf{v}_a \text{tanh}(\mathbf{W}_a \mathbf{H}^\top))\\ \mathbf{c}

In practice, we enforce the following orthogonality constraint to penalize redundancy and encourage diversity in the attention vectors in the form of the squared Frobenius norm: \(\Omega = \|(\mathbf{A}\mathbf{A}^\top - \mathbf{I} \|^2_F \) A

Key-value attention   Finally, key-value attention (Daniluk et al., 2017) [19] is a recent attention variant that separates form from function by keeping separate vectors for the attention calculation.

While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015;

Recent advances in Bayesian Optimization have made it an ideal tool for the black-box optimization of hyperparameters in neural networks (Snoek et al., 2012) [56] and far more efficient than the widely used grid search.

Rather than clipping each gradient independently, clipping the global norm of the gradient (Pascanu et al., 2013) [58] yields more significant improvements (a Tensorflow implementation can be found here).

While many of the existing best practices are with regard to a particular part of the model architecture, the following guidelines discuss choices for the model's output and prediction stage.

Using IOBES and BIO yield similar performance (Lample et al., 2017) CRF output layer   If there are any dependencies between outputs, such as in named entity recognition the final softmax layer can be replaced with a linear-chain conditional random field (CRF).

If attention is used, we can keep track of a coverage vector \(\mathbf{c}_i\), which is the sum of attention distributions \(\mathbf{a}_t\) over previous time steps (Tu et al., 2016;

See et al., 2017) [64, 65]: \(\mathbf{c}_i = \sum\limits^{i-1}_{t=1} \mathbf{a}_t \) This vector captures how much attention we have paid to all words in the source.

We can now condition additive attention additionally on this coverage vector in order to encourage our model not to attend to the same words repeatedly: \(f_{att}(\mathbf{h}_i,\mathbf{s}_j,\mathbf{c}_i) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j + \mathbf{W}_3 \mathbf{c}_i )\) In addition, we can add an auxiliary loss that captures the task-specific attention behaviour that we would like to elicit: For NMT, we would like to have a roughly one-to-one alignment;

Beam search strategy   Medium beam sizes around \(10\) with length normalization penalty of \(1.0\) (Wu et al., 2016) yield the best performance (Britz et al., 2017).

BPE iteratively merges frequent symbol pairs, which eventually results in frequent character n-grams being merged into a single symbol, thereby effectively eliminating out-of-vocabulary-words.

While it was originally meant to handle rare words, a model with sub-word units outperforms full-word systems across the board, with 32,000 being an effective vocabulary size for sub-word units (Denkowski

How to Do Sentiment Analysis - Intro to Deep Learning #3

In this video, we'll use machine learning to help classify emotions! The example we'll use is classifying a movie review as either positive or negative via TF Learn in 20 lines of Python. ...

Deep Learning Lecture 13: Applying RNN's to Sentiment Analysis

Get my larger machine learning course at We'll practice using recurrent neural networks..

Lesson 5: Practical Deep Learning for Coders

INTRO TO NLP AND RNNS We start by combining everything we've learned so far to see what that buys us; and we discover that we get a Kaggle-winning result! One important point: in this lesson...

Lecture 15: Coreference Resolution

Lecture 15 covers what is coreference via a working example. Also includes research highlight "Summarizing Source Code", an introduction to coreference resolution and neural coreference resolution....

Lesson 6: Practical Deep Learning for Coders

BUILDING RNNS This lesson starts by introducing a new tool, the MixIterator, which will (finally!) allow us to fully implement the pseudo-labeling technique we learnt a couple of lessons ago....

Lecture 18: Tackling the Limits of Deep Learning for NLP

Lecture 18 looks at tackling the limits of deep learning for NLP followed by a few presentations. ------------------------------------------------------------------------------- Natural Language...