AI News, Understanding Convolutional Neural Networks for NLP

Understanding Convolutional Neural Networks for NLP

Each entry corresponds to one pixel, 0 for black and 1 for white (typically it’s between 0 and 255 for grayscale images).

(To understand this one intuitively, think about what happens in parts of the image that are smooth, where a pixel color equals that of its neighbors: The additions cancel and the resulting value is 0, or black.

If there’s a sharp edge in intensity, a transition from white to black for example, you get a large difference and a resulting white value)

This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results.

For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers.

Because you are sliding your filters over the whole image you don’t really care where the elephant occurs. In practice,  pooling also gives you invariance to translation, rotation and scaling, but more on that later.

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character.

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words).

The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this (take a few minutes and try understand this picture and how the dimensions are computed.

Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean”

The simple Bag of Words model is an obvious oversimplification with incorrect assumptions, but has nonetheless been the standard approach for years and lead to pretty good results.

A larger stride size leads to fewer applications of the filter and a smaller output size. The following from the Stanford cs231 website shows stride sizes of 1 and 2 applied to a one-dimensional input: In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a model that behaves somewhat similarly to a Recursive Neural Network, i.e.

The most common way to do pooling it to apply a operation to the result of each filter. You don’t necessarily need to pool over the complete matrix, you could also pool over a window. For example, the following shows max pooling for a 2×2 window (in NLP we typically are apply pooling over the complete output, yielding just a single number for each filter): Why pooling?

For example, if you have 1,000 filters and you apply max pooling to each, you will get a 1000-dimensional output, regardless of the size of your filters, or the size of your input.

By performing the max operation you  are keeping information about whether or not the feature appeared in the sentence, but you are losing information about where exactly it appeared.

In NLP you could imagine having various channels as well: You could have a separate channels for different word embeddings (word2vec and GloVe for example), or you could have a channel for the same sentence represented in different languages, or phrased in different ways.

Convolutions and pooling operations lose information about the local order of words, so that sequence tagging as in PoS Tagging or Entity Extraction is a bit harder to fit into a pure CNN architecture (though not impossible, you can add positional features to the input).

Intuitively, it makes sense that using pre-trained word embeddings for short texts would yield larger gains than using them for long texts.

Building a CNN architecture means that there are many hyperparameters to choose from, some of which I presented above: Input represenations (word2vec, GloVe, one-hot), number and sizes of convolution filters, pooling strategies (max, average), and activation functions (ReLU, tanh). [7] performs an empirical evaluation on the effect of varying hyperparameters in CNN architectures, investigating their impact on performance and variance over multiple runs.

A few results that stand out are that max-pooling always beat average pooling, that the ideal filter sizes are important but task-dependent, and that regularization doesn’t seem to make a big different in the NLP tasks that were considered.

In addition to the word vectors, the authors use the relative positions of words to the entities of interest as an input to the convolutional layer. This models assumes that the positions of the entities are given, and that each example input contains one relation.

[13] presents a CNN architecture to predict hashtags for Facebook posts, while at the same time generating meaningful embeddings for words and sentences. These learned embeddings are then successfully applied to another task –

Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples).

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

Deep Learning Approach for Extreme Multi-label Text Classification

Extreme classification is a rapidly growing research area focusing on multi-class and multi-label problems involving an extremely large number of labels.

Convolutional Neural Networks Explained | Lecture 7

An intuitive explanation of Convolutional Neural Networks. Deep Learning Crash Course playlist: ...

Lecture 12 | Visualizing and Understanding

In Lecture 12 we discuss methods for visualizing and understanding the internal mechanisms of convolutional networks. We also discuss the use of ...

Lecture 18: Tackling the Limits of Deep Learning for NLP

Lecture 18 looks at tackling the limits of deep learning for NLP followed by a few presentations.

Lecture 5 | Convolutional Neural Networks

In Lecture 5 we move from fully-connected neural networks to convolutional neural networks. We discuss some of the key historical milestones in the ...

Graph neural networks: Variations and applications

Many real-world tasks require understanding interactions between a set of entities. Examples include interacting atoms in chemical molecules, people in social ...

Lecture 15 | Efficient Methods and Hardware for Deep Learning

In Lecture 15, guest lecturer Song Han discusses algorithms and specialized hardware that can be used to accelerate training and inference of deep learning ...

NW-NLP 2018: Compositional Language Modeling for Icon-Based Augmentative & Alternative Communication

The fifth Pacific Northwest Regional Natural Language Processing Workshop will be held on Friday, April 27, 2018, in Redmond, WA. We accepted abstracts ...

Detecting and Recognizing Text in Natural Images

Text in natural images possesses rich information for image understanding. Detecting and recognizing text facilitates many important applications. From a ...