AI News, Understanding Convolutional Neural Networks for NLP

Understanding Convolutional Neural Networks for NLP

Each entry corresponds to one pixel, 0 for black and 1 for white (typically it’s between 0 and 255 for grayscale images).

(To understand this one intuitively, think about what happens in parts of the image that are smooth, where a pixel color equals that of its neighbors: The additions cancel and the resulting value is 0, or black.

If there’s a sharp edge in intensity, a transition from white to black for example, you get a large difference and a resulting white value)

This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results.

For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers.

Because you are sliding your filters over the whole image you don’t really care where the elephant occurs. In practice,  pooling also gives you invariance to translation, rotation and scaling, but more on that later.

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character.

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words).

The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this (take a few minutes and try understand this picture and how the dimensions are computed.

Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean”

The simple Bag of Words model is an obvious oversimplification with incorrect assumptions, but has nonetheless been the standard approach for years and lead to pretty good results.

A larger stride size leads to fewer applications of the filter and a smaller output size. The following from the Stanford cs231 website shows stride sizes of 1 and 2 applied to a one-dimensional input: In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a model that behaves somewhat similarly to a Recursive Neural Network, i.e.

The most common way to do pooling it to apply a operation to the result of each filter. You don’t necessarily need to pool over the complete matrix, you could also pool over a window. For example, the following shows max pooling for a 2×2 window (in NLP we typically are apply pooling over the complete output, yielding just a single number for each filter): Why pooling?

For example, if you have 1,000 filters and you apply max pooling to each, you will get a 1000-dimensional output, regardless of the size of your filters, or the size of your input.

By performing the max operation you  are keeping information about whether or not the feature appeared in the sentence, but you are losing information about where exactly it appeared.

In NLP you could imagine having various channels as well: You could have a separate channels for different word embeddings (word2vec and GloVe for example), or you could have a channel for the same sentence represented in different languages, or phrased in different ways.

Convolutions and pooling operations lose information about the local order of words, so that sequence tagging as in PoS Tagging or Entity Extraction is a bit harder to fit into a pure CNN architecture (though not impossible, you can add positional features to the input).

Intuitively, it makes sense that using pre-trained word embeddings for short texts would yield larger gains than using them for long texts.

Building a CNN architecture means that there are many hyperparameters to choose from, some of which I presented above: Input represenations (word2vec, GloVe, one-hot), number and sizes of convolution filters, pooling strategies (max, average), and activation functions (ReLU, tanh). [7] performs an empirical evaluation on the effect of varying hyperparameters in CNN architectures, investigating their impact on performance and variance over multiple runs.

A few results that stand out are that max-pooling always beat average pooling, that the ideal filter sizes are important but task-dependent, and that regularization doesn’t seem to make a big different in the NLP tasks that were considered.

In addition to the word vectors, the authors use the relative positions of words to the entities of interest as an input to the convolutional layer. This models assumes that the positions of the entities are given, and that each example input contains one relation.

[13] presents a CNN architecture to predict hashtags for Facebook posts, while at the same time generating meaningful embeddings for words and sentences. These learned embeddings are then successfully applied to another task –

Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples).

Understanding Convolutional Neural Networks for NLP

A key aspect of Convolutional Neural Networks are pooling layers, typically applied after the convolutional layers. Pooling layers subsample their input.

The most common way to do pooling it to apply a operation to the result of each filter. You don’t necessarily need to pool over the complete matrix, you could also pool over a window. For example, the following shows max pooling for a 2×2 window (in NLP we typically are apply pooling over the complete output, yielding just a single number for each filter): Fig.

By performing the max operation you  are keeping information about whether or not the feature appeared in the sentence, but you are losing information about where exactly it appeared.

You are losing global information about locality (where in a sentence something happens), but you are keeping local information captured by your filters, like “not amazing” being very different from “amazing not”.

In NLP you could imagine having various channels as well: You could have a separate channels for different word embeddings (word2vec and GloVe for example), or you could have a channel for the same sentence represented in different languages, or phrased in different ways.

Convolutions and pooling operations lose information about the local order of words, so that sequence tagging as in PoS Tagging or Entity Extraction is a bit harder to fit into a pure CNN architecture (though not impossible, you can add positional features to the input).

Building a CNN architecture means that there are many hyperparameters to choose from, some of which I presented above: Input represenations (word2vec, GloVe, one-hot), number and sizes of convolution filters, pooling strategies (max, average), and activation functions (ReLU, tanh). [7] performs an empirical evaluation on the effect of varying hyperparameters in CNN architectures, investigating their impact on performance and variance over multiple runs.

A few results that stand out are that max-pooling always beat average pooling, that the ideal filter sizes are important but task-dependent, and that regularization doesn’t seem to make a big different in the NLP tasks that were considered.

In addition to the word vectors, the authors use the relative positions of words to the entities of interest as an input to the convolutional layer. This models assumes that the positions of the entities are given, and that each example input contains one relation.

These learned embeddings are then successfully applied to another task – recommending potentially interesting documents to users, trained based on clickstream data.

Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples).

Implementing a CNN for Text Classification in TensorFlow

The model presented in the paper achieves good classification performance across a range of text classification tasks (like Sentiment Analysis) and has since become a standard baseline for new text classification architectures.

Also, the dataset doesn’t come with an official train/test split, so we simply use 10% of the data as a dev set.

won’t go over the data pre-processing code in this post, but it is available on Github and does the following: The network we will build in this post looks roughly as follows:

Next, we max-pool the result of the convolutional layer into a long feature vector, add dropout regularization, and classify the result using a softmax layer.

Because this is an educational post I decided to simplify the model from the original paper a little: It is relatively straightforward (a few dozen lines of code) to add the above extensions to the code here.

To allow various hyperparameter configurations we put our code into a TextCNN class, generating the model graph in the init function.

To instantiate the class we then pass the following arguments: We start by defining the input data that we pass to our network: tf.placeholder creates a placeholder variable that we feed to the network when we execute it at train or test time.

TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.

The result of our embedding doesn’t contain the channel dimension, so we add it manually, leaving us with a layer of shape [None, sequence_length, embedding_size, 1].

Because each convolution produces tensors of different shapes we need to iterate through them, create a layer for each of them, and then merge the results into one big feature vector.

'VALID' padding means that we slide the filter over our sentence without padding the edges, performing a narrow convolution that gives us an output of shape [1, sequence_length - filter_size + 1, 1, 1].

Performing max-pooling over the output of a specific filter size leaves us with a tensor of shape [batch_size, 1, 1, num_filters].

Once we have all the pooled output tensors from each filter size we combine them into one long feature vector of shape [batch_size, num_filters_total].

Using the feature vector from max-pooling (with dropout applied) we can generate predictions by doing a matrix multiplication and picking the class with the highest score.

Here, tf.nn.softmax_cross_entropy_with_logits is a convenience function that calculates the cross-entropy loss for each class, given our scores and the correct input labels.

The allow_soft_placement setting allows TensorFlow to fall back on a device with a certain operation implemented when the preferred device doesn’t exist.

When we instantiate our TextCNN models all the variables and operations defined will be placed into the default graph and session we’ve created above.

Checkpoints can be used to continue training at a later point, or to pick the best parameters setting using early stopping.

Let’s now define a function for a single training step, evaluating the model on a batch of data and updating the model parameters.

We write a similar function to evaluate the loss and accuracy on an arbitrary data set, such as a validation set or the whole training set.

We iterate over batches of our data, call the train_step function for each batch, and occasionally evaluate and checkpoint our model: Here, batch_iter is a helper function I wrote to batch the data, and tf.train.global_step is convenience function that returns the value of global_step.

Our training script writes summaries to an output directory, and by pointing TensorBoard to that directory we can visualize the graph and the summaries we created.

Running the training procedure with default parameters (128-dimensional embeddings, filter sizes of 3, 4 and 5, dropout of 0.5 and 128 filters per filter size) results in the following loss and accuracy plots (blue is training data, red is 10% dev data).

Convolutional neural network

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.[1] They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.[2][3] Convolutional networks were inspired by biological processes[4] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex.

A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable.

The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters.[8] For instance, regardless of image size, tiling regions of size 5 x 5, each with the same shared weights, requires only 25 learnable parameters.

In this way, it resolves the vanishing or exploding gradients problem in training traditional multi-layer neural networks with many layers by using backpropagation[citation needed].

Convolutional networks may include local or global pooling layers[clarification needed], which combine the outputs of neuron clusters at one layer into a single neuron in the next layer.[9][10] For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer.[11] Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer[citation needed].

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey visual cortexes contain neurons that individually respond to small regions of the visual field.

Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field[citation needed].

Their 1968 paper[12] identified two basic visual cell types in the brain: The neocognitron [13] was introduced in 1980.[11][14] The neocognitron does not require units located at multiple network positions to have the same trainable weights.

This idea appears in 1986 in the book version of the original backpropagation paper.[15]:Figure 14 Neocognitrons were developed in 1988 for temporal signals.[clarification needed][16] Their design was improved in 1998,[17] generalized in 2003[18] and simplified in the same year.[19] LeNet-5, a pioneering 7-level convolutional network by LeCun et al.

Similarly, a shift invariant neural network was proposed for image character recognition in 1988.[2][3] The architecture and training algorithm were modified in 1991[20] and applied for medical image processing[21] and automatic detection of breast cancer in mammograms.[22] A

This design was modified in 1989 to other de-convolution-based designs.[24][25] The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid[26] by lateral and feedback connections.

Following the 2005 paper that established the value of GPGPU for machine learning,[27] several publications described more efficient ways to train convolutional neural networks using GPUs.[28][29][30][31] In 2011, they were refined and implemented on a GPU, with impressive results.[9] In 2012, Ciresan et al.

significantly improved on the best performance in the literature for multiple image databases, including the MNIST database, the NORB database, the HWDB1.0 dataset (Chinese characters), the CIFAR10 dataset (dataset of 60000 32x32 labeled RGB images),[11] and the ImageNet dataset.[32] While traditional multilayer perceptron (MLP) models were successfully used for image recognition[example needed], due to the full connectivity between nodes they suffer from the curse of dimensionality, and thus do not scale well to higher resolution images.

For example, in CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in a first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights.

Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together.

Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.

The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume.

During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter.

Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account.

Convolutional networks exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.

Since all neurons in a single depth slice share the same parameters, then the forward pass in each depth slice of the CONV layer can be computed as a convolution of the neuron's weights with the input volume (hence the name: convolutional layer).

The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume.

The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network, and hence to also control overfitting.

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples at every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which works better in practice.[34] Due to the aggressive reduction in the size of the representation, the trend is towards using smaller filters[35] or discarding the pooling layer altogether.[36] Region of Interest pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.[37] Pooling is an important component of convolutional neural networks for object detection based on Fast R-CNN[38] architecture.

Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

In stochastic pooling,[43] the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region.

Since the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting.

Since these networks are usually trained with all available data, one approach is to either generate new data from scratch (if possible) or perturb existing data to create new ones.

For example, input images could be asymmetrically cropped by a few percent to create new examples with the same label as the original.[45] One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur.

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth.

simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node.

after seeing a new shape once they can recognize it from a different viewpoint.[47] Currently, the common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc.

The pose relative to retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.[48] Thus, one way of representing something is to embed the coordinate frame within it.

The vectors of neuronal activity that represent pose ('pose vectors') allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints.

In 2012 an error rate of 0.23 percent on the MNIST database was reported.[11] Another paper on using CNN for image classification reported that the learning process was 'surprisingly fast';

in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database.[9] When applied to facial recognition, CNNs achieved a large decrease in error rate.[50] Another paper reported a 97.6 percent recognition rate on '5,600 still images of more than 10 subjects'.[4] CNNs were used to assess video quality in an objective way after manual training;

the resulting system had a very low root mean square error.[51] The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes.

The winner GoogLeNet[53] (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date.

That performance of convolutional neural networks on the ImageNet tests was close to that of humans.[54] The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand.

For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.

One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space.[56][57] Another way is to fuse the features of two convolutional neural networks, one for the spatial and one for the temporal stream.[58][59] Unsupervised learning schemes for training spatio-temporal features have been introduced, based on Convolutional Gated Restricted Boltzmann Machines[60] and Independent Subspace Analysis.[61] CNNs have also explored natural language processing.

CNN models are effective for various NLP problems and achieved excellent results in semantic parsing,[62] search query retrieval,[63] sentence modeling,[64] classification,[65] prediction[66] and other traditional NLP tasks.[67] CNNs have been used in drug discovery.

In 2015, Atomwise introduced AtomNet, the first deep learning neural network for structure-based rational drug design.[68] The system trains directly on 3-dimensional representations of chemical interactions.

Similar to how image recognition networks learn to compose smaller, spatially proximate features into larger, complex structures,[69] AtomNet discovers chemical features, such as aromaticity, sp3 carbons and hydrogen bonding.

Subsequently, AtomNet was used to predict novel candidate biomolecules for multiple disease targets, most notably treatments for the Ebola virus[70] and multiple sclerosis.[71] CNNs have been used in the game of checkers.

The learning process did not use prior human professional games, but rather focused on a minimal set of information contained in the checkerboard: the location and type of pieces, and the piece differential[clarify].

Ultimately, the program (Blondie24) was tested on 165 games against players and ranked in the highest 0.4%.[72][73] It also earned a win against the program Chinook at its 'expert' level of play.[74] CNNs have been used in computer Go.

In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play.[75] Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player.

When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.[76] A

couple of CNNs for choosing moves to try ('policy network') and evaluating positions ('value network') driving MCTS were used by AlphaGo, the first to beat the best human player at the time.[77] For many applications, little training data is available.

Other deep reinforcement learning models preceded it.[80] Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks.

time delay neural network allows speech signals to be processed time-invariantly, analogous to the translation invariance offered by CNNs.[83] They were introduced in the early 1980s.