AI News, Natural language processing: What would Shakespearesay?

Natural language processing: What would Shakespearesay?

In this scene  Cooper, a crew member of the Endurance spaceship which is on its way to 3 distant planets via a wormhole, is conversing with TARS which is one of  US Marine Corps former robots some year in the future.

Natural Language has been an area of serious research for several decades ever since Alan Turing in 1950 proposed a test in which a human evaluator would simultaneously judge natural language conversations between another human and a machine, that is designed to generate human-like responses, behind a closed doors.

This title of this post should really be ‘Natural language Processing: What would Shakespeare say, and what would you say’ because this post includes two interactive apps that can predict the next word a) The first app given a (Shakespearean) phrase will predict the most likely word that Shakespeare would have said Try

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.

NLP encompasses many areas from computer science  besides inputs from the domain of  linguistics , psychology, information theory, mathematics and statistics  However NLP is a difficult domain as each language has its own quirkiness and ambiguities,  and English is no different.

The empiricists  approached natural language as a data driven problem based on statistics while the rationalist school led by Noam Chomsky, the linguist,  strongly believed that sentence structure should be analyzed at a deeper level than mere surface statistics.

He cites 2 sentences  (a) ‘colorless green ideas sleep furiously’  (b) ‘furiously sleep ideas green colorless’.  Chomsky’s contention is that while neither sentence or  any of its parts, have ever occurred in the past linguistic experience of  English it can be easily inferred that   (a) is grammatical, while (b) is not.

thanks to great strides in processing power and the significant drop in hardware the empiricists approach to Natural Language Processing  made a comeback in the mid 1980s.  The use of probabilistic language models combined with the increase in the  power of processing saw the rise of the empiricists again.

In this post I showcase 2 Shiny apps written in R that predict the next word given a phrase using  statistical approaches, belonging to the empiricist school of thought.

The 1st one will try to predict what Shakespeare would have said  given a phrase (Shakespearean or otherwise)  and the 2nd is a regular app that will predict what we would say in our regular day to day conversation.

In order to  build a language model the program ingests a large corpora of documents.  For the a) Shakespearean app, the corpus is the “Complete Works of Shakespeare“.  This is also available in Free ebooks by Project Gutenberg but you will have to do some cleaning and tokenzing before using it.

But the fact that we have not seen these combinations in the corpus should not  mean that they could never occur, So the MLE for the bigrams, trigrams etc have be smoothed so that it does not have a 0 conditional probability.

This is the simplest smoothing technique and is also known as the ‘add +1’ smoothing technique and requires that 1 be added to all counts So the  MLE below

If this is not found this is backed of my searching smoothed MLEs for trigrams for the phrase ‘my way’ and if this not found search the bigram for the next word to ‘way’.

Smoothing: what if we see new n-grams?

This course covers a wide range of tasks in Natural Language Processing from basic to advanced: sentiment analysis, summarization, dialogue state tracking, to name a few.

The project will be based on practical assignments of the course, that will give you hands-on experience with such tasks as text classification, named entities recognition, and duplicates detection.

To succeed in that, we expect your familiarity with the basics of linear algebra and probability theory, machine learning setup, and deep neural networks.

Language Model: A Survey of the State-of-the-Art Technology

Abstract The goal of language modelling is to estimate the probability distribution of various linguistic units, e.g., words, sentences etc.

Although it has been shown that continuous-space language models can obtain good performance, they suffer from some important drawbacks, including a very long training time and limitations on the number of context words.

The count-based methods, such as traditional statistical models, usually involve making an n-th order Markov assumption and estimating n-gram probabilities via counting and subsequent smoothing.

The LM probability p(w1,w2,…,wn) is a product of word probabilities based on a history of preceding words, whereby the history is limited to m words: This is also called a Markov chain, where the number of previous states (words here) is the order of the model.

The basic idea for n-gram LM is that we can predict the probability of w_(n+1) with its preceding context, by dividing the number of occurrences of w_n, w_(n+1) by the number of occurrences of w_n, which then would be called a bigram.

The estimation of a trigram word prediction probability (most often used for LMs in practical NLP applications) is therefore straightforward, assuming maximum likelihood estimation: However, when modeling the joint distribution of a sentence, a simple n-gram model would give zero probability to all of the combination that were not encountered in the training corpus, i.e.

Despite the smoothing techniques mentioned above, and the practical usability of n-gram LM, the curse of dimensionality (as for many other learning algorithms) is especially potent here, as there is a huge number of different combinations of values of the input variables that must be discriminated from each other.

For LM, this is the huge number of possible sequences of words, e.g., with a sequence of 10 words taken from a vocabulary of 100,000, there are 10⁵⁰ possible sequences.

For example, one would wish from a good LM that it can recognize a sequence like “the cat is walking in the bedroom” to be syntactically and semantically similar to “a dog was running in the room”, which cannot be provided by an n-gram model [4].

These reasons lead to the idea of applying deep learning and Neural Networks to the problem of LM, in hopes of automatically learning such syntactic and semantic features, and to overcome the curse of dimensionality by generating better generalizations with NNs.

Subsequent works have turned to focus on sub-word modelling and corpus-level modelling based on recurrent neural network and its variant — long short-term memory network (LSTM).

The first neural approach to LM is a neural probabilistic language model [4], which learns the parameters of conditional probability distribution of the next word, given the previous n-1 words using a feed-forward neural network of three layers.

In this model, each word in the vocabulary is associated with a distributed word feature vector, and the joint probability function of words sequence is expressed by a function of the feature vectors of these words in the sequence.

Therefore, several other open questions for the future are addressed, mostly concerning speed-up techniques, more compact probability representations (trees), and introducing a-priori knowledge (semantic information etc.

The basic idea in these papers is to cluster similar words before computing their probability, in order to only have to do one computation per word cluster at the output layer of the NN.

The authors first trained a model using a random tree over corpus, then extracted the word representations from the trained model, and performed hierarchical clustering on the extracted representations.

Since what only matters for generating a probability at each node is the predicted feature vector, determined by the context, the probability of the current word can be expressed as a product of probabilities of the binary decisions: where d_i is the i-th encoding for word w_i, and q_i is the feature vector for the i-th node in the path to the corresponding word encoding.

The above probability definition can be extended to multiple encodings per word and a summation over all encodings, which allows better prediction of words with multiple senses in multiple contexts.

The best HLBL model reported in [6] reduces perplexity by 11.1% compared to a baseline Kneser-Ney smoothed 5-gram LM, at only 32 minutes training time per epoch.

After introducing hierarchical tree of words, the models can be trained and tested more quickly, and can outperform non-hierarchical neural models as well as the best n-gram model.

The recurrent neural network based language model (RNNLM) [7] provides further generalization: instead of considering just several preceding words, neurons with input from recurrent connections assumed to represent short term memory.

At the same time, a gated word-character recurrent LM[10] is presented to address the same issue that information about morphemes such as prefix, root, and suffix is lost, and rare word problems using word-level LM.

Bag-of-words model

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR).

In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier[3].

So, as we see in the bag algebra, the 'union' of two documents in the bags-of-words representation is, formally, the disjoint union, summing the multiplicities of each element.

The most common type of characteristics, or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text.

Similarly, the second entry corresponds to the word 'likes' which is the second word in the list, and its value is '2' because 'likes' appears in the first document 2 times.

To address this problem, one of the most popular ways to 'normalize' the term frequencies is to weight a term by the inverse of document frequency, or tf–idf.

Applying to the same example above, a bigram model will parse the text into the following units and store the term frequency of each unit as before.

In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ('ham').

While any given word is likely to be found somewhere in both bags, the 'spam' bag will contain spam-related words such as 'stock', 'Viagra', and 'buy' much more frequently, while the 'ham' bag will contain more words related to the user's friends or workplace.

To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.

Natural language processing - n gram model - trigram example

Natural language processing - n gram model - trigram example.

n-gram predict demo

General Sequence Learning using Recurrent Neural Networks

indico's Head of Research, Alec Radford, led a workshop on general sequence learning using recurrent neural networks at Next.ML in San Francisco.

cs224n lecture02