# AI News,

Artificial neural networks imitate concepts that we use when we think of the human brain.

We'll use embeddings and recurrent neural networks for sentiment classification of reviews from movies: we want to know if they contain a positive or negative sentiment. The

For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer '3' encodes the 3rd most frequent word in the data.

This allows for quick filtering operations such as: 'only consider the top 10,000 most common words, but eliminate the top 20 most common words As a convention, '0' does not stand for a specific word, but instead is used to encode any unknown word.

worst mistake of my life br br i picked this movie up at target for 5 because i figured hey it's sandler i can get some cheap laughs i was wrong completely wrong mid way through the film all three of my friends were asleep and i was still suffering worst plot worst script worst movie i have ever seen i wanted to hit my head up against a wall for an hour then i'd stop and you know why because it felt damn good upon bashing my head in i stuck that damn movie in the &lt;UNK&gt;

and watched it burn and that felt better than anything else i've ever done it took american psycho army of darkness and kill bill just to get over that crap i hate you sandler for actually going through with this and ruining a whole day of my life Note the special words like &lt;START&gt;

cut reviews after 80 words and pad them if needed: Now that we have our data, let's discuss the concepts behind our model!

embedding layer learns the relations between words, the recurrent layer learns what the document is about and the dense layer translates that to sentiment.

With one-hot encoding, the vocabulary '$$\textsf{code - console - cry - cat - dog}$$' would be represented like this: The three text snippets '$$\textsf{code console}$$', '$$\textsf{cry cat}$$' and '$$\textsf{dog}$$' are represented by combining these word vectors: This representation has some problems.

Instead of learning from one-hot encoding, we first let the neural network embed words in a smaller, continuous vector space where similar words are close to each other. The

Such an embedding for our vocabulary could look like this: We only need two dimensions for our words instead of five, '$$\mathsf{cat}$$' is close to '$$\mathsf{dog}$$', and '$$\mathsf{console}$$' is somewhere between '$$\mathsf{code}$$' and '$$\mathsf{cry}$$'. Closeness

If we'd be interested in understanding a document like in the previous example, we could use the following architecture: The left side of the figure shows a short-hand of the neural network, the right side shows the unrolled version.

The first layer learns a good representation of words, the second learns to combine words in a single idea, and the final layer turns this idea into a classification. We

## Deep Learning, NLP, and Representations

In the last few years, deep neural networks have dominated pattern recognition.

This post reviews some extremely remarkable results in applying deep neural networks to natural language processing (NLP).

In my personal opinion, word embeddings are one of the most exciting area of research in deep learning at the moment, although they were originally introduced by Bengio, et al.

word embedding $$W: \mathrm{words} \to \mathbb{R}^n$$ is a paramaterized function mapping words in some language to high-dimensional vectors (perhaps 200 to 500 dimensions).

For example, we might find: $W(\text{cat}\!&quot;) = (0.2,~ \text{-}0.4,~ 0.7,~ ...)$ $W(\text{mat}\!&quot;) = (0.0,~ 0.6,~ \text{-}0.1,~ ...)$ (Typically, the function is a lookup table, parameterized by a matrix, $$\theta$$, with a row for each word: $$W_\theta(w_n) = \theta_n$$.) $$W$$ is initialized to have random vectors for each word.

For example, one task we might train a network for is predicting whether a 5-gram (sequence of five words) is ‘valid.’ We can easily get lots of 5-grams from Wikipedia (eg.

The model we train will run each word in the 5-gram through $$W$$ to get a vector representing it and feed those into another ‘module’ called $$R$$ which tries to predict if the 5-gram is ‘valid’ or ‘broken.’ Then, we’d like: $R(W(\text{cat}\!&quot;),~ W(\text{sat}\!&quot;),~ W(\text{on}\!&quot;),~ W(\text{the}\!&quot;),~ W(\text{mat}\!&quot;)) = 1$ $R(W(\text{cat}\!&quot;),~ W(\text{sat}\!&quot;),~ W(\text{song}\!&quot;),~ W(\text{the}\!&quot;),~ W(\text{mat}\!&quot;)) = 0$ In order to predict these values accurately, the network needs to learn good parameters for both $$W$$ and $$R$$.

In the remainder of this section we will talk about many word embedding results and won’t distinguish between different approaches.) One thing we can do to get a feel for the word embedding space is to visualize them with t-SNE, a sophisticated technique for visualizing high-dimensional data.

“a few people sing well” $$\to$$ “a couple people sing well”), the validity of the sentence doesn’t change.

While, from a naive perspective, the input sentence has changed a lot, if $$W$$ maps synonyms (like “few” and “couple”) close together, from $$R$$’s perspective little changes.

It seems quite likely that there are lots of situations where it has seen a sentence like “the wall is blue” and know that it is valid before it sees a sentence like “the wall is red”.

For example, there seems to be a constant male-female difference vector: $W(\text{woman}\!&quot;) - W(\text{man}\!&quot;) ~\simeq~ W(\text{aunt}\!&quot;) - W(\text{uncle}\!&quot;)$ $W(\text{woman}\!&quot;) - W(\text{man}\!&quot;) ~\simeq~ W(\text{queen}\!&quot;) - W(\text{king}\!&quot;)$ This may not seem too surprising.

You write, “she is the aunt” but “he is the uncle.” Similarly, “he is the King” but “she is the Queen.” If one sees “she is the uncle,” the most likely explanation is a grammatical error.

We learned the word embedding in order to do well on a simple task, but based on the nice properties we’ve observed in word embeddings, you may suspect that they could be generally useful in NLP tasks.

In fact, word representations like these are extremely important: The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling.

(2013)) This general tactic – learning a good representation on a task A and then using it on a task B – is one of the major tricks in the Deep Learning toolbox.

Instead of learning a way to represent one kind of data and using it to perform multiple kinds of tasks, we can learn a way to map multiple kinds of data into a single representation!

Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.

Recently, deep learning has begun exploring models that embed images and words in a single representation.5 The basic idea is that one classifies images by outputting a vector in a word embedding.

For example, if the model wasn’t trained to classify cats – that is, to map them near the “cat” vector – what happens when we try to classify images of cats?

The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al.

Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word.

Breasts and, less reliably, long hair, makeup and jewelery, are obvious indicators of being female.6 Even if you’ve never seen a king before, if the queen, determined to be such by the presence of a crown, suddenly has a beard, it’s pretty reasonable to give the male version.) Shared embeddings are an extremely exciting area of research and drive at why the representation focused perspective of deep learning is so compelling.

This approach, of building neural networks from smaller neural network “modules” that can be composed together, is not very wide spread.

If one considers the phrase “the cat sat on the mat”, it can naturally be bracketed into segments: “((the cat) (sat (on (the mat))))”.

(2013c) uses a recursive neural network to predict sentence sentiment: One major goal has been to create a reversible sentence representation, a representation that one can reconstruct an actual sentence from, with roughly the same meaning.

This post reviews a lot of research results I find very exciting, but my main motivation is to set the stage for a future post exploring connections between deep learning, type theory and functional programming.

## Practical Neural Networks with Keras: Classifying Yelp Reviews

Keras is a high-level deep learning library that makes it easy to build Neural Networks in a few lines of Python.

We&#8217;ll use a subset of Yelp Challenge Dataset, which contains over 4 million Yelp reviews, and we&#8217;ll train our classifier to discriminate between positive and negative reviews.

Then we&#8217;ll compare the Neural Network classifier to a Support Vector Machine (SVM) on the same dataset, and show that even though Neural Networks are breaking records in most machine learning benchmarks, the humbler SVM is still a great solution for many problems.

We won&#8217;t be covering any of the mathematics or theory behind the deep learning concepts presented, so you&#8217;ll be able to follow even without any background in machine learning.

You can use your own machine, or from any other cloud provider that offers GPU-compute virtual private servers, but then you&#8217;ll need to install and configure: In our example, we&#8217;ll be using the AWS Deep Learning AMI, which has all of the above pre-installed and ready to use.

The EC2 p2.xlarge instances that we&#8217;ll be using usually cost around $1 per hour, while the same machine using Spot Pricing is usually around$0.20 per hour (depending on current demand).

You&#8217;ll need to go through a fairly long sign-up process, and have a valid credit card, but once you have an account, launching machines in the cloud can be done a few clicks.

You&#8217;ll need to follow the instructions at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html if you&#8217;re not used to working with SSH and key pairs, and specifically it&#8217;s important to change the permissions on your new private key before you can use by running the following command, substituting ‘my-key-pair.pem’ with the full path and name of your private key.

Large AWS machines present juicy targets to attackers who want to use them for botnets or other nefarious purposes, so by default they don&#8217;t allow any incoming traffic.

If you&#8217;re using Mac or Linux, you can SSH simply by opening a terminal and typing the following command, replacing the 11.111.111.111 with the public IP address that you copied above, and the /path/to/your/private-key.pem with the full path to your private key file.

If you see an error when trying to connect via SSH, you can run the command again and add a -vvv flag which will show you verbose connection information.

If you&#8217;re on Windows, you can’t use the ssh command by default, but you can work around this through any of the following options: After connecting to the instance, we&#8217;ll need to run a couple of commands.

On the server (in the same window that you connected to the server with SSH), run the following commands: pip3 install keras --upgrade --user Now open a tmux session by running: tmux This creates a virtual session in the terminal.

Running a tmux session will allow us to resume our session after reconnecting to the server if necessary, leaving all our code running in the background.

You can type tmux a -t 0 (for &#8220;tmux attach session 0&#8221;) to attach the session again if you need to stop the server or view any of the output.

The data is compressed as a .tgz file, which you can transfer to the AWS instance by running the following command on your local machine (after you have downloaded the dataset from Yelp): scp -i ~/path/to/your/private/key.pem yelp_dataset_challenge_round9.tgz ubuntu@11.111.111.111:~ Once again, substitute with the path to your private key file and the public IP address of your VPS as appropriate.

The colon separates the SSH connection string of your instance from the path that you want to upload the file to, and the ~ represents the home directory of your VPS.

We&#8217;ll, therefore, take a sample of the Yelp reviews which contains the same amount of positive (four or five-star reviews) and negative (one, two, or three-star reviews).

Keras represents each word as a number, with the most common word in a given dataset being represented as 1, the second most common as a 2, and so on.

If we have our data tokenized with the more common words having lower numbers, we can easily train on only the N most common words in our dataset, and adjust N as necessary (for larger datasets, we would want a larger N, as even comparatively rare words will appear often enough to be useful).

By looking at the other sequences, we can infer that 1 represents the word &#8220;is&#8221;, 4 represents &#8220;a&#8221;, and 2 represents &#8220;common&#8221;.

We can take a look at the tokenizer word_index, which stores to the word-to-token mapping to confirm this: print(tokenizer.word_index) which outputs: Rare words, such as &#8220;discombobulation&#8221;

You can see the last text is represented only by [1,2] even though it originally contained four words, because two of the words are not part of the top 5 words.

Keras has the pad_sequences function to do this, which will pad with leading zeros to make all the texts the same length as the longest one: padded_sequences = pad_sequences(sequences) print(padded_sequences) Which outputs: The last text has now been transformed from [1, 2] to [0, 0, 1, 2] in order to make it as long as the longest text (the first one).

The maths behind RNNs gets a bit hairy, and even more so when we add the concept LSTMs, which allows the neural network to pay more attention to certain parts of a sequence, and to largely ignore words which aren&#8217;t as useful.

As this is where the actual learning takes place, you&#8217;ll need a significant amount of time to run this step (approximately two hours on the Amazon GPU machine).

This means it will take half the reviews, along with their labels, and try to find patterns in the tokens that represent positive or negative labels.

It will then try to predict the answers for the other half, without looking at the labels, and compare these predictions to the real labels, to see how good the patterns it learned are.

data (data that it didn&#8217;t look at during the training stage), we can verify that the patterns it is learning are actually useful to us, and not overly-specific to the data we trained it on.

After each epoch completes, you&#8217;ll also see the val_loss and val_acc numbers appear, which represent the loss and accuracy on the held-out validation data.

of two items) and learn whether this was positive or negative, instead of having to learn the longer and more difficult sequence of the six independent items [&#8220;I&#8221;, &#8220;loved&#8221;, &#8220;this&#8221;, &#8220;friendly&#8221;, &#8220;guest&#8221;, &#8220;house&#8221;].

As a comparison, we&#8217;ll build a Support Vector Machine classifier using scikit-learn, another high-level machine learning Python library that allows us to create complex models in a few lines of code.

Run the following in a new cell: Now, instead of converting each word to a single number and learning an Embedding layer, we use a term-frequency inverse document frequency (TF-IDF) vectorisation process.

Using this vectorisation scheme, we ignore the order of the words completely, and represent each review as a large sparse matrix, with each cell representing a specific word, and how often it appears in that review.

We normalize the counts by the total number of times the word appears in all of the reviews, so rare words are given a higher importance than common ones (though we ignore all words that aren&#8217;t seen in at least three different reviews.

We need about a minute and a half to vectorize the reviews, transforming each of 200 000 reviews into a vector containing 775 090 features.

Conversely, for smaller datasets, the SVM is much better than the neural network — try running all of the above code again for 2 000 reviews instead of 200 000 and you&#8217;ll see that the neural network really battles to find meaningful patterns if you limit the training examples.

For example, you could download thousands of news reports before an election and use our model to see whether mainly positive or mainly negative things are being said about key political figures, and how that sentiment is changing over time.

If we want to classify more texts, we must use the same tokenizers without refitting them (a new dataset will have a different most-common word, but our neural network learned specific things about the word representations from the Yelp dataset.

For the Keras model, we can save the tokenizer and the trained model as follows (make sure that you have h5py installed with pip3 install h5py first if you&#8217;re not using the pre-configured AWS instance).

If we want to predict whether some new piece of text is positive or negative, we can load our model and get a prediction with: To package the SVM model, we similarly need to package both the vectoriser and the classifier.

## On the contribution of neural networks and word embeddings in Natural Language Processing

Neural networks have contributed to outstanding advancements in fields such as computer vision [1,2] and speech recognition [3].

In this post I will try to explain, in a very simplified way, how to apply neural networks and integrate word embeddings in text-based applications, and some of the main implicit benefits of using neural networks and word embeddings in NLP.

Word embeddings are (roughly) dense vector representations of wordforms in which similar words are expected to be close in the vector space.

The construction of these word embeddings varies, but in general a neural language model is trained on a large corpus and the output of the network is used to learn word vectors (e.g.

Depending on the specific model, the learning process is carried out in one way or another, but what needs to be learnt are the weights representing the connections (word embeddings are also generally updated throughout the learning process), which is generally achieved through backpropagation.

If you are interested in knowing more details, the “Neural Network Methods in Natural Language Processing” book by Yoav Goldberg provides an exhaustive overview of neural networks applied to NLP (a short version is also freely available online [7]).

For example, given the following sentence inside a text: In order to apply our linear model we should first split the text into words (i.e.

Nevertheless, this would require further feature engineering, and even with this integration, it is likely some negative words are left out, or that they are simply not enough given the richness of the human language.

There have been approaches attempting to deal with negation and the polarity of sentences at the phrase level [11], but these kinds of solution get increasingly specific and over-complicated.

Therefore, instead of adding complex features for dealing with all cases (practically impossible), neural architectures take the whole sorted sequence into account, and not each word in isolation.

These are common to most neural architectures modeling text corpora, but of course there are more in-depth details of each specific learning algorithm which need to be studied separately and may contribute to other aspects of the learning process.

Lecture 3 | GloVe: Global Vectors for Word Representation

Lecture 3 introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embeddings) by ...

How to Make a Text Summarizer - Intro to Deep Learning #10

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, ...

Review Session: Midterm Review

This midterm review session covers work vectors representations, neural networks and RNNs. Also reviewed is backpropagation, gradient calculation and ...

From Deep Learning of Disentangled Representations to Higher-level Cognition

One of the main challenges for AI remains unsupervised learning, at which humans are much better than machines, and which we link to another challenge: ...

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with ...

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other ...

Lecture 4: Word Window Classification and Neural Networks

Lecture 4 introduces single and multilayer neural networks, and how they can be used for classification purposes. Key phrases: Neural networks. Forward ...

CMU Neural Nets for NLP 2017 (3): Models of Words

This lecture (by Graham Neubig) for CMU CS 11-747, Neural Networks for NLP (Fall 2017) covers: * Describing a word by the company that it keeps * Counting ...

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: ...

How Deep Neural Networks Work

A gentle introduction to the principles behind neural networks, including backpropagation. Rated G for general audiences. Follow me for announcements: ...