AI News, Deep Learning for NLP Best Practices

Deep Learning for NLP Best Practices

While many existing Deep Learning libraries already encode best practices for working with neural networks in general, such as initialization schemes, many other details, particularly task or domain-specific considerations, are left to the practitioner.

While many of these features will be most useful for pushing the state-of-the-art, I hope that wider knowledge of them will lead to stronger evaluations, more meaningful comparison to baselines, and inspiration by shaping our intuition of what works.

I will then outline practices that are relevant for the most common tasks, in particular classification, sequence labelling, natural language generation, and neural machine translation.

The optimal dimensionality of word embeddings is mostly task-dependent: a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al., 2016) [44] or part-of-speech (POS) tagging (Plank et al., 2016) [32], while a larger dimensionality is more useful for more semantic tasks such as sentiment analysis (Ruder et al., 2016) [45].

First let us assume a one-layer MLP, which applies an affine transformation followed by a non-linearity \(g\) to its input \(\mathbf{x}\): \(\mathbf{h} = g(\mathbf{W}\mathbf{x} + \mathbf{b})\) A

highway layer then computes the following function instead: \(\mathbf{h} = \mathbf{t} \odot g(\mathbf{W} \mathbf{x} + \mathbf{b}) + (1-\mathbf{t}) \odot \mathbf{x} \) where \(\odot\) is elementwise multiplication, \(\mathbf{t} = \sigma(\mathbf{W}_T \mathbf{x} + \mathbf{b}_T)\) is called the transform gate, and \((1-\mathbf{t})\) is called the carry gate.

Residual connections are even more straightforward than highway layers and learn the following function: \(\mathbf{h} = g(\mathbf{W}\mathbf{x} + \mathbf{b}) + \mathbf{x}\) which simply adds the input of the current layer to its output via a short-cut connection.

Dense connections   Rather than just adding layers from each layer to the next, dense connections (Huang et al., 2017) [7] (best paper award at CVPR 2017) add direct connections from each layer to all subsequent layers.

They have also found to be useful for Multi-Task Learning of different NLP tasks (Ruder et al., 2017) [49], while a residual variant that uses summation has been shown to consistently outperform residual connections for neural machine translation (Britz et al., 2017) [27].

While batch normalisation in computer vision has made other regularizers obsolete in most applications, dropout (Srivasta et al., 2014) [8] is still the go-to regularizer for deep neural networks in NLP.

Recurrent dropout has been used for instance to achieve state-of-the-art results in semantic role labelling (He et al., 2017) and language modelling (Melis et al., 2017) [34].

While we can already predict surrounding words in order to pre-train word embeddings (Mikolov et al., 2013), we can also use this as an auxiliary objective during training (Rei, 2017) [35].

Using attention, we obtain a context vector \(\mathbf{c}_i\) based on hidden states \(\mathbf{s}_1, \ldots, \mathbf{s}_m\) that can be used together with the current hidden state \(\mathbf{h}_i\) for prediction.

The context vector \(\mathbf{c}_i\) at position is calculated as an average of the previous states weighted with the attention scores \(\mathbf{a}_i\): \(\begin{align}\begin{split}\mathbf{c}_i = \sum\limits_j a_{ij}\mathbf{s}_j\\ \mathbf{a}_i

Additive attention   The original attention mechanism (Bahdanau et al., 2015) [15] uses a one-hidden layer feed-forward network to calculate the attention alignment: \(f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_a[\mathbf{h}_i;

Analogously, we can also use matrices \(\mathbf{W}_1\) and \(\mathbf{W}_2\) to learn separate transformations for \(\mathbf{h}_i\) and \(\mathbf{s}_j\) respectively, which are then summed: \(f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j) \) Multiplicative attention   Multiplicative attention (Luong et al., 2015) [16] simplifies the attention operation by calculating the following function: \(f_{att}(h_i, s_j) = h_i^\top \mathbf{W}_a s_j \) Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication.

Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [37].

Self-attention   Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [18].

Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [38], textual entailment (Parikh et al., 2016) [39], and abstractive summarization (Paulus et al., 2017) [40].

We can simplify additive attention to compute the unnormalized alignment score for each hidden state \(\mathbf{h}_i\): \(f_{att}(\mathbf{h}_i) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_a \mathbf{h}_i) \) In matrix form, for hidden states \(\mathbf{H} = \mathbf{h}_1, \ldots, \mathbf{h}_n\) we can calculate the attention vector \(\mathbf{a}\) and the final sentence representation \(\mathbf{c}\) as follows: \(\begin{align}\begin{split}\mathbf{a} = \text{softmax}(\mathbf{v}_a \text{tanh}(\mathbf{W}_a \mathbf{H}^\top))\\ \mathbf{c}

In practice, we enforce the following orthogonality constraint to penalize redundancy and encourage diversity in the attention vectors in the form of the squared Frobenius norm: \(\Omega = \|(\mathbf{A}\mathbf{A}^\top - \mathbf{I} \|^2_F \) A

Key-value attention   Finally, key-value attention (Daniluk et al., 2017) [19] is a recent attention variant that separates form from function by keeping separate vectors for the attention calculation.

While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015;

Recent advances in Bayesian Optimization have made it an ideal tool for the black-box optimization of hyperparameters in neural networks (Snoek et al., 2012) [56] and far more efficient than the widely used grid search.

Rather than clipping each gradient independently, clipping the global norm of the gradient (Pascanu et al., 2013) [58] yields more significant improvements (a Tensorflow implementation can be found here).

While many of the existing best practices are with regard to a particular part of the model architecture, the following guidelines discuss choices for the model's output and prediction stage.

Using IOBES and BIO yield similar performance (Lample et al., 2017) CRF output layer   If there are any dependencies between outputs, such as in named entity recognition the final softmax layer can be replaced with a linear-chain conditional random field (CRF).

If attention is used, we can keep track of a coverage vector \(\mathbf{c}_i\), which is the sum of attention distributions \(\mathbf{a}_t\) over previous time steps (Tu et al., 2016;

See et al., 2017) [64, 65]: \(\mathbf{c}_i = \sum\limits^{i-1}_{t=1} \mathbf{a}_t \) This vector captures how much attention we have paid to all words in the source.

We can now condition additive attention additionally on this coverage vector in order to encourage our model not to attend to the same words repeatedly: \(f_{att}(\mathbf{h}_i,\mathbf{s}_j,\mathbf{c}_i) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j + \mathbf{W}_3 \mathbf{c}_i )\) In addition, we can add an auxiliary loss that captures the task-specific attention behaviour that we would like to elicit: For NMT, we would like to have a roughly one-to-one alignment;

Beam search strategy   Medium beam sizes around \(10\) with length normalization penalty of \(1.0\) (Wu et al., 2016) yield the best performance (Britz et al., 2017).

BPE iteratively merges frequent symbol pairs, which eventually results in frequent character n-grams being merged into a single symbol, thereby effectively eliminating out-of-vocabulary-words.

While it was originally meant to handle rare words, a model with sub-word units outperforms full-word systems across the board, with 32,000 being an effective vocabulary size for sub-word units (Denkowski

Connectome-based predictive modeling of attention: Comparing different functional connectivity features and prediction methods across datasets.

Shen et al., 2017) was recently developed to predict individual differences in traits and behaviors, including fluid intelligence (Finn et al., 2015) and sustained attention (Rosenberg et al., 2016a), from functional brain connectivity (FC) measured with fMRI.

We defined connectome-based models using task-based or resting-state FC data, and tested the effects of (1) functional connectivity measure and (2) feature-selection/prediction algorithm on individualized attention predictions.

Models defined using all combinations of functional connectivity measure (Pearson's correlation, accordance, and discordance) and prediction algorithm (linear and PLS regression) predicted attentional abilities, with correlations between predicted and observed measures of attention as high as 0.9 for internal validation, and 0.6 for external validation (all p's < 0.05).

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with attention as well as sequence model...

Bruno Mars - 24K Magic [Victoria’s Secret 2016 Fashion Show Performance]

Get the new album '24K Magic' out now: See Bruno on the '24K Magic World Tour'! Tickets on sale now. Visit for dates Stream '24K Magic'.

Deep Learning and Language Model - Part-2

This tutorial Explains the Encoder-Decoder RNN and the Language Model with Encoder-Decoder RNN. References used: 1. Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. « Describing Multimedia...

PR-049: Attention is All You Need

제가 리뷰한 논문은 Attention is All You Need 입니다. CNN이나 RNN을 쓰지 않고 attention만을 써서 만든 네트워크 이며 기계번역에서 성능, 연산속도 향상을...

Bruno Mars - Chunky [Victoria’s Secret 2016 Fashion Show Performance]

Get the new album '24K Magic' out now: See Bruno on the '24K Magic World Tour'! Tickets on sale now. Visit for dates Stream '24K Magic'.



Demi Lovato: Simply Complicated - Official Documentary

Watch never before seen footage in the Simply Complicated Director's Cut Demi Lovato: Simply Complicated is a full length documentary that gives..

State of Alert Israel style - (vpro backlight documentary - 2017)

'When it comes to security, Europe is still in a state of denial' – (Michal Marmary, Homeland Security Tel Aviv) Recent suicide attacks in Paris, Brussels, Nice and Berlin have greatly...

VEGAN 2017 - The Film

Plant Based News' end of year film Vegan 2017 is here. ☆ ENJOY THE FILM? PLEASE SUPPORT US SO WE CAN MAKE MORE OF THEM: ☆ Watch Vegan 2016:

Densely Connected Convolutional Networks

Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they...