AI News, Deep Learning Research Review Week 3&#58 Natural Language Processing

Deep Learning Research Review Week 3&#58 Natural Language Processing

This post will be structured in a way where we’ll go through the basic building blocks of building deep networks for NLP and then go into talking about some applications through recent research papers.

It’ll feel normal to not exactly know why we’re using RNNs or why an LSTM is helpful, but hopefully by the end of the research papers, you’ll have a better sense of why deep learning techniques have helped NLP so much.

coocurence matrix is a matrix that contains the number of counts of each word appearing next to all the other words in the corpus (or training set).

With a larger dataset than just one sentence, you can imagine that this similarity will become more clear as ‘like’, ‘love’, and other synonyms will begin to have similar word vectors, because of the fact that they are used in similar contexts.

The basic idea behind word vector initialization techniques is that we want to store as much information as we can in this word vector while still keeping the dimensionality at a manageable scale (25 –

This is definitely one of the more confusing equations to understand, so if you’re still having trouble visualizing what’s happening, you can go here and here for additional resources.

One Sentence Summary: Word2Vec seeks to find vector representations of different words by maximizing the log probability of context words given a center word and modifying the vectors through SGD.

(Optional: The authors of the paper then go into more detail about how negative sampling and subsampling of frequent words can be used to get more precise word vectors.

If you take a close look at the superscripts, you’ll see that there’s a weight matrix Whx which we’re going to multiply with our input, and there’s a recurrent weight matrix Whh which is multiplied with the hidden state vector at the previous time step.

Additionally, the order of inputs in this sequence can largely affect how the weight matrices and hidden state vectors change during training.

0.25), then by the 3rd or 4th module, the gradient will have practically vanished (chain rule multiplies gradients together) and thus the hidden states of the earlier time steps won’t get updated.

(The open dot indicates a Hadamard product) Now, if you take a closer look at the formulation, you’ll see that if the reset gate unit is close to 0, then that whole term becomes 0 as well, thus ignoring the information in ht-1 from the previous time steps.

The final formulation of h(t) is written as htis a function of all 3 components: the reset gate, the update gate, and the memory container.

When zt is close to 1, the new hidden state vector ht is mostly dependent on the previous hidden state and we ignore the current memory container because (1-zt) goes to 0.

When zt is close to 0, the new hidden state vector ht is mostly dependent on the current memory container and we ignore the previous hidden state.

Since the middle sentence has absolutely no impact on the question at hand, the reset and update gates will allow the network to “forget”

Since this can be thought of as an extension to the idea behind a GRU, I won’t go too far into the analysis, but for a more in depth walkthrough of each gate and each piece of computation, check out Chris Olah’s amazingly well written blog post.

Long term dependencies refer to situations where two words or phrases may occur at very different time steps, but the relationship between them is still critical to solving the end goal.

Recursive neural networks and CNNs for NLP are sometimes used in practice, but aren’t as prevalent as RNNs, which really are the backbone behind most deep learning NLP systems.

Since there are numerous different problem areas within NLP (from machine translation to question answering), there are a number of papers that we could look into, but here are 3 that I found to be particularly insightful.

The intuitive idea is that in order to accurately answer a question regarding a piece of text, you need to somehow store the initial information given to you.

If I were to ask you the question “What does RNN stand for”, (assuming you’ve read this post fully J) you’ll be able to give me an answer because the information you absorbed by reading the first part of this post was stored somewhere in your memory.

This is in part because the task of question answering relies so heavily upon being able to model or keep track of long-term dependencies, such as keeping track of the characters in a story or a timeline of events.

At first look, RNNs and LSTMs could be used, but these typically aren’t able to remember or memorize inputs from the past (which in question answering is quite critical).

The next step would be taking the feature representation I(x) and allowing our memory m to be updated to reflect the new input x we’ve received.

The 3rd and 4th steps involve reading from memory, based on the question, to get a feature representation o, and then decoding it to output a final answer r.

We take the argmax of the scoring function to find the output representation that best supports the question (You can also take multiple of the highest scoring units, doesn’t have to be limited to 1).

The scoring function is one that computes the matrix product between different embeddings of the question and the chosen memory unit[s] (check paper for more details).

This network is trained in a supervised manner where training data includes the original text, the question, supporting sentences, and the ground truth answer.

The motivation behind this non-linear arrangement lies in the notion that natural language exhibits the property that words in sequence become phrases.

This approach required large amounts of linguistic domain knowledge and ultimately its design proved to be too brittle and lacked generalization ability.

It turns out the more effective approach (that NMT uses) is to translate the whole sentence at a time, thus allowing for a broader context and a more natural rearrangement of words.

From a high level, the encoder works on the task on transforming the input sentence to vector representation, the decoder produces the output representation, and then the attention module tells the decoder what to focus on during the task of decoding (This is where the idea of utilizing the whole context of the sentence comes in).

In my mind, some future goals in the field could be to improve customer service chatbots, perfect machine translation, and hopefully get question answering systems to obtain a deeper understanding of unstructured or lengthy pieces of text (like Wikipedia pages).

Understanding LSTM Networks

In the above diagram, a chunk of neural network, \(A\), looks at some input \(x_t\) and outputs a value \(h_t\).

Consider what happens if we unroll the loop: This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists.

In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version.

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame.

If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky.

Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form.

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers.

Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.

This decision is made by a sigmoid layer called the “forget gate layer.” It looks at \(h_{t-1}\) and \(x_t\), and outputs a number between \(0\) and \(1\) for each number in the cell state \(C_{t-1}\).

A \(1\) represents “completely keep this” while a \(0\) represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Then, we put the cell state through \(\tanh\) (to push the values to be between \(-1\) and \(1\)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next.

For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes.

For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs.

Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks.

It’s a multi-part series in which I’m planning to cover the following: As part of the tutorial we will implement a recurrent neural network based language model.

The applications of language models are two-fold: First, it allows us to score arbitrary sentences based on how likely they are to occur in the real world.

If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations.

By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows: There are a few things to note here: RNNs have shown great success in many NLP tasks.

A side-effect of being able to predict the next word is that we get a generative model, which allows us to generate new text by sampling from the output probabilities.

In Language Modeling our input is typically a sequence of words (encoded as one-hot vectors for example), and our output is the sequence of predicted words.

A key difference is that our output only starts after we have seen the complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

Research papers about Machine Translation: Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.

Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps.

The memory in LSTMs are called cells and you can think of them as black boxes that take as input the previous state and current input .

I hope you’ve gotten a basic understanding of what RNNs are and what they can do. In the next post we’ll implement a first version of our language model RNN using Python and Theano. Please leave questions in the comments!

The Unreasonable Effectiveness of Recurrent Neural Networks

Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?” By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs.

A few examples may make this more concrete: As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems.

You might be thinking that having sequences as inputs or outputs could be relatively rare, but an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to process them in a sequential manner.

On the right, a recurrent network generates images of digits by learning to sequentially add color to a canvas (Gregor et al.): The takeaway is that even if your data is not in form of sequences, you can still formulate and train powerful models that learn to process it sequentially.

If you’re more comfortable with math notation, we can also write the hidden state update as \( h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t ) \), where tanh is applied elementwise.

We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs y you’d like to see in response to your input sequences x.

Here’s a diagram: For example, we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”.

Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers).

If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea.

And if you have to act the big company too.” Okay, clearly the above is unfortunately not going to replace Paul Graham anytime soon, but remember that the RNN had to learn English completely from scratch and with a small dataset (including where you put commas, apostrophes and spaces).

In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say: “is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same” looks like we’ve reached an infinite loop about startups.

There’s also quite a lot of structured markdown that the model learns, for example sometimes it creates headings, lists, etc.: Sometimes the model snaps into a mode of generating random but valid XML: The model completely makes up the timestamp, id, and so on.

We had to step in and fix a few issues manually but then you get plausible looking math, it’s quite astonishing: Here’s another sample: As you can see above, sometimes the model tries to generate latex diagrams, but clearly it hasn’t really figured them out.

This is an example of a problem we’d have to fix manually, and is likely due to the fact that the dependency is too long-term: By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma.

In particular, I took all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file (474MB of C code) (I was originally going to train only on the kernel but that by itself is only ~16MB).

This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code: There are too many fun parts to cover- I could probably write an entire blog post on just this part.

Of course, you can imagine this being quite useful inspiration when writing a novel, or naming a new startup :) We saw that the results at the end of training can be impressive, but how does any of this work?

At 300 iterations we see that the model starts to get an idea about quotes and periods: The words are now also separated with spaces and the model starts to get the idea about periods at the end of a sentence.

Longer words have now been learned as well: Until at last we start to get properly spelled words, quotations, names, and so on by about iteration 2000: The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words;

In the visualizations below we feed a Wikipedia RNN model character data from the validation set (shown along the blue/green rows) and under every character we visualize (in red) the top 5 guesses that the model assigns for the next character.

Think about it as green = very excited and blue = not very excited (for those familiar with details of LSTMs, these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state).

Below we’ll look at 4 different ones that I found and thought were interesting or interpretable (many also aren’t): Of course, a lot of these conclusions are slightly hand-wavy as the hidden state of the RNN is a huge, high-dimensional and largely distributed representation.

We can see that in addition to a large portion of cells that do not do anything interpretible, about 5% of them turn out to have learned quite interesting and interpretible algorithms: Again, what is beautiful about this is that we didn’t have to hardcode at any point that if you’re trying to predict the next character it might, for example, be useful to keep track of whether or not you are currently inside or outside of quote.

I’ve only started working with Torch/LUA over the last few months and it hasn’t been easy (I spent a good amount of time digging through the raw Torch code on Github and asking questions on their gitter to get things done), but once you get a hang of things it offers a lot of flexibility and speed.

Here’s a brief sketch of a few recent developments (definitely not complete list, and a lot of this work draws from research back to 1990s, see related work sections): In the domain of NLP/Speech, RNNs transcribe speech to text, perform machine translation, generate handwritten text, and of course, they have been used as powerful language models (Sutskever et al.) (Graves) (Mikolov et al.) (both on the level of characters and words).

For example, we’re seeing RNNs in frame-level video classification, image captioning (also including my own work and many others), video captioning and very recently visual question answering.

My personal favorite RNNs in Computer Vision paper is Recurrent Models of Visual Attention, both due to its high-level direction (sequential processing of images with glances) and the low-level modeling (REINFORCE learning rule that is a special case of policy gradient methods in Reinforcement Learning, which allows one to train models that perform non-differentiable computation (taking glances around the image in this case)).

I’m confident that this type of hybrid model that consists of a blend of CNN for raw perception coupled with an RNN glance policy on top will become pervasive in perception, especially for more complex tasks that go beyond classifying some objects in plain view.

One problem is that RNNs are not inductive: They memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the correct way (I’ll provide pointers in a bit that make this more concrete).

This paper sketched a path towards models that can perform read/write operations between large, external memory arrays and a smaller set of memory registers (think of these as our working memory) where the computation happens.

Now, I don’t want to dive into too many details but a soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly).

Think of this as declaring a pointer in C that doesn’t point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!).

If you’d like to play with training RNNs I hear good things about keras or passage for Theano, the code released with this post for Torch, or this gist for raw numpy code I wrote a while ago that implements an efficient, batched LSTM forward and backward pass.

Unfortunately, at about 46K characters I haven’t written enough data to properly feed the RNN, but the returned sample (generated with low temperature to get a more typical sample) is: Yes, the post was about RNN and how well it works, so clearly this works :).

Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano

 To understand what this means, let’s look at how a LSTM calculates a hidden state (I’m using to mean elementwise multiplication):

With that in mind let’s try to get an intuition for how a LSTM unit computes the hidden state. Chris Olah has an excellent post that goes into details on this and to avoid duplicating his effort I will only give a brief explanation here.

If you fix the input gate all 1’s, the forget gate to all 0’s (you always forget the previous memory) and the output gate to all one’s (you expose the whole memory) you almost get standard RNN.

A common one is creating peephole connections that allow the gates to not only depend on the previous hidden state , but also on the previous internal state , adding an additional term in the gate equations.

The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM, but there are a few key differences: Now that you’ve seen two models  to combat the vanishing gradient problem you may be wondering: Which one to use?

In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture.

If you are for somehow forced to calculate the gradients yourself, you probably want to modularize different units and have your own version of auto-differentiation using the chain rule.

It turns out this isn’t such a great idea. If you set your learning rate low enough, SGD is guaranteed to make progress towards a good solution, but in practice that would take a very long time. There exist a number of commonly used variations on SGD, including the (Nesterov) Momentum Method, AdaGrad, AdaDelta and rmsprop.

The basic idea behind rmsprop is to adjust the learning rate per-parameter according to the a (smoothed) sum of the previous gradients.

Intuitively this means that frequently occurring features get a smaller learning rate (because the sum of their gradients is larger), and rare features get a larger learning rate.

For each parameter we keep a cache variable and during gradient descent we update the parameter and the cache as follows (example for ): The decay is typically set to 0.9 or 0.95 and the 1e-6 term is added to avoid division by 0.

I didn’t use pre-trained word vectors in my experiments, but adding an embedding layer (the matrix in our code) makes it easy to plug them in.

the ith column vector corresponds to the ith word in our vocabulary. By updating the matrix we are learning word vectors ourselves, but they are very specific to our task (and data set) and not as general as those that you can download, which are trained on millions or billions of documents.

You’ll likely see diminishing returns after 2-3 layers and unless you have a huge amount of data (which we don’t) more layers are unlikely to make a big difference and may lead to overfitting.

Instead of learning from one sentence at a time, you want to group sentences of the same length (or even pad all sentences to have the same length) and then perform large matrix multiplications and sum up gradients for the whole batch. That’s because such large matrix multiplications are efficiently handled by a GPU.

Visualizations of Recurrent Neural Networks

Recurrent neural networks (RNNs) have been shown to be effective in modeling sequences.

It has wide applicability in many domains of text: chat bots, machine translation, language modeling, etc.

The input and output vectors are drawn from the data and they are either word / character embeddings or output labels.

The overall score of the sentence (0.839 in this case) is calculated by passing the last hidden vector through the output function.

[gist line 87] The score for each word can be calculated by passing all the hidden vectors through the output function.

The hidden vector values at the highest level should be smoother than the lower levels across time steps.

By taking the gradient of the loss function with respect to the RNN hidden layer, it is easy to tell which words are salient.

[gist line 102] The word “why” is usually at the start of the sentence and included alongside a question.

In politeness research, questions containing “why” have a high chance of becoming impolite, unlike “what”.

It adds a memory vector, c(t), that decides which neurons to pass through for a learned proportion, p(t).

The latter is called a gate and it modulates between old and new information by using another neural network layer.

[gist line 89 for weight matrices but I wasn’t sure how to get the gate outputs themselves] For the i(t) graph, we notice that the words “accommodation”, “discount” and “reservation” are unimportant.

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: Language Models. RNN. Bi-directional...

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other sequence tasks, and bidirectional and deep...

Lecture 10 | Recurrent Neural Networks

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language modeling and image captioning, and how...

RNN Example in Tensorflow - Deep Learning with Neural Networks 11

In this deep learning with TensorFlow tutorial, we cover how to implement a Recurrent Neural Network, with an LSTM (long short term memory) cell with the MNIST dataset.

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with attention as well as sequence model...

Lecture 16: Dynamic Neural Networks for Question Answering

Lecture 16 addresses the question ""Can all NLP tasks be seen as question answering problems?"". Key phrases: Coreference Resolution, Dynamic Memory Networks for Question Answering over Text...

How to Predict Stock Prices Easily - Intro to Deep Learning #7

Only a few days left to sign up for my new course! Learn more and sign-up here We're going to predict the closing price of the S&P..

Lecture 12: End-to-End Models for Speech Processing

Lecture 12 looks at traditional speech recognition systems and motivation for end-to-end models. Also covered are Connectionist Temporal Classification (CTC) and Listen Attend and Spell (LAS),...

4. Echo State Networks

Video from Coursera - University of Toronto - Course: Neural Networks for Machine Learning:

RNN Symposium 2016: Jason Weston - New Tasks & Architectures for Language Understanding and Dialogue

NIPS 2016 Symposium on Recurrent Neural Networks and Other Machines that Learn Algorithms Barcelona, 8 December 2016. Slides etc. are available on the website: