AI News, Natural language processing: What would Shakespearesay?

Natural language processing: What would Shakespearesay?

In this scene  Cooper, a crew member of the Endurance spaceship which is on its way to 3 distant planets via a wormhole, is conversing with TARS which is one of  US Marine Corps former robots some year in the future.

Natural Language has been an area of serious research for several decades ever since Alan Turing in 1950 proposed a test in which a human evaluator would simultaneously judge natural language conversations between another human and a machine, that is designed to generate human-like responses, behind a closed doors.

This title of this post should really be ‘Natural language Processing: What would Shakespeare say, and what would you say’ because this post includes two interactive apps that can predict the next word a) The first app given a (Shakespearean) phrase will predict the most likely word that Shakespeare would have said Try

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.

NLP encompasses many areas from computer science  besides inputs from the domain of  linguistics , psychology, information theory, mathematics and statistics  However NLP is a difficult domain as each language has its own quirkiness and ambiguities,  and English is no different.

The empiricists  approached natural language as a data driven problem based on statistics while the rationalist school led by Noam Chomsky, the linguist,  strongly believed that sentence structure should be analyzed at a deeper level than mere surface statistics.

He cites 2 sentences  (a) ‘colorless green ideas sleep furiously’  (b) ‘furiously sleep ideas green colorless’.  Chomsky’s contention is that while neither sentence or  any of its parts, have ever occurred in the past linguistic experience of  English it can be easily inferred that   (a) is grammatical, while (b) is not.

thanks to great strides in processing power and the significant drop in hardware the empiricists approach to Natural Language Processing  made a comeback in the mid 1980s.  The use of probabilistic language models combined with the increase in the  power of processing saw the rise of the empiricists again.

In this post I showcase 2 Shiny apps written in R that predict the next word given a phrase using  statistical approaches, belonging to the empiricist school of thought.

The 1st one will try to predict what Shakespeare would have said  given a phrase (Shakespearean or otherwise)  and the 2nd is a regular app that will predict what we would say in our regular day to day conversation.

In order to  build a language model the program ingests a large corpora of documents.  For the a) Shakespearean app, the corpus is the “Complete Works of Shakespeare“.  This is also available in Free ebooks by Project Gutenberg but you will have to do some cleaning and tokenzing before using it.

But the fact that we have not seen these combinations in the corpus should not  mean that they could never occur, So the MLE for the bigrams, trigrams etc have be smoothed so that it does not have a 0 conditional probability.

This is the simplest smoothing technique and is also known as the ‘add +1’ smoothing technique and requires that 1 be added to all counts So the  MLE below

If this is not found this is backed of my searching smoothed MLEs for trigrams for the phrase ‘my way’ and if this not found search the bigram for the next word to ‘way’.

Language model

A statistical language model is a probability distribution over sequences of words.

Given such a sequence, say of length m, it assigns a probability

(

w

{\displaystyle P(w_{1},\ldots ,w_{m})}

to the whole sequence.

Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications, especially ones that generate text as an output.

Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

In speech recognition, the computer tries to match sounds with word sequences.

The language model provides context to distinguish between words and phrases that sound similar.

For example, in American English, the phrases 'recognize speech' and 'wreck a nice beach' are pronounced almost the same but mean very different things.

These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.

Language models are used in information retrieval in the query likelihood model.

Here a separate language model is associated with each document in a collection.

Documents are ranked based on the probability of the query Q in the document's language model

Commonly, the unigram language model is used for this purpose—otherwise known as the bag of words model.

Data sparsity is a major problem in building language models.

Most possible word sequences will not be observed in training.

One solution is to make the assumption that the probability of a word only depends on the previous n words.

This is known as an n-gram model or unigram model when n = 1.

unigram model used in information retrieval can be treated as the combination of several one-state finite automata.[1] It splits the probabilities of different terms in a context, e.g.

{\displaystyle P(t_{1}t_{2}t_{3})=P(t_{1})P(t_{2}\mid t_{1})P(t_{3}\mid t_{1}t_{2})}

In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units.

The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1.

The following is an illustration of a unigram model of a document.

The probability generated for a specific query is calculated as For different documents, we can build their own unigram models, with different hitting probabilities of words in it.

And we use probabilities from different documents to generate different hitting probabilities for a query.

Then we can rank documents for a query according to the generating probabilities.

Next is an example of two unigram models of two documents.

In information retrieval contexts, unigram language models are often smoothed to avoid instances where P(term) = 0.

A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for each document to create a smoothed document model.[2] In an n-gram model, the probability

{\displaystyle P(w_{1},\ldots ,w_{m})}

of observing the sentence

{\displaystyle w_{1},\ldots ,w_{m}}

is approximated as Here, it is assumed that the probability of observing the ith word wi in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (nth order Markov property).

The conditional probability can be calculated from n-gram model frequency counts: The words bigram and trigram language model denote n-gram model language models with n = 2 and n = 3, respectively.[3] Typically, however, the n-gram model probabilities are not derived directly from the frequency counts, because models derived this way have severe problems when confronted with any n-grams that have not explicitly been seen before.

Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or n-grams.

Various methods are used, from simple 'add-one' smoothing (assign a count of 1 to unseen n-grams) to more sophisticated models, such as Good-Turing discounting or back-off models.

In a bigram (n = 2) language model, the probability of the sentence I saw the red house is approximated as whereas in a trigram (n = 3) language model, the approximation is Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions.

The equation is

{\displaystyle P(w_{1},\ldots ,w_{m})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})}}\exp(a^{T}f(w_{1},\ldots ,w_{m}))}

{\displaystyle Z(w_{1},\ldots ,w_{m-1})}

is the partition function,

{\displaystyle a}

is the parameter vector, and

{\displaystyle f(w_{1},\ldots ,w_{m})}

is the feature function.

In the simplest case, the feature function is just an indicator of the presence of a certain n-gram.

It is helpful to use a prior on

{\displaystyle a}

or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Neural language models (or Continuous space language models) use continuous representations or embeddings of words to make their predictions[citation needed].

These models make use of Neural networks[citation needed].

Continuous space embeddings help to alleviate the curse of dimensionality in language modeling: as language models are trained on larger and larger texts, the number of unique words (the vocabulary) increases[a] and the number of possible sequences of words increases exponentially with the size of the vocabulary, causing a data sparsity problem because for each of the exponentially many sequences.

Thus statistics are needed to properly estimate probabilities.

Neural networks avoid this problem by representing words in a distributed way, as non-linear combinations of weights in a neural net.[4] An alternate description is that a neural net approximate the language function.

The neural net architecture might be feed-forward or recurrent, and while the former is simpler the later is more common.

[example needed][citation needed] Typically, neural net language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution I.e., the network is trained to predict a probability distribution over the vocabulary, given some linguistic context.

This is done using standard neural net training algorithms such as stochastic gradient descent with backpropagation.[4] The context might be a fixed-size window of previous words, so that the network predicts from a feature vector representing the previous k words.[4] Another option is to use 'future' words as well as 'past' words as features, so that the estimated probability is[5] A

third option, that allows faster training, is to invert the previous problem and make a neural network learn the context, given a word.

One then maximizes the log-probability[6] This is called a skip-gram language model, and is the basis of the popular[7] word2vec program.

Instead of using neural net language models to produce actual probabilities, it is common to instead use the distributed representation encoded in the networks' 'hidden' layers as representations of words;

each word is then mapped onto an n-dimensional real vector called the word embedding, where n is the size of the layer just before the output layer.

The representations in skip-gram models have the distinct characteristic that they model semantic relations between words as linear combinations, capturing a form of compositionality.

For example, in some such models, if v is the function that maps a word w to its n-d vector representation, then where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.[5][6] A

positional language model[8] is one that describes the probability of given words occurring close to one another in a text, not necessarily immediately adjacent.

Similarly, bag-of-concepts models[9] leverage on the semantics associated with multi-word expressions such as buy_christmas_present, even when they are used in information-rich sentences like 'today I bought a lot of very nice Christmas presents'.

ttezel/gist:4138642 Last active Mar 6, 2018

#A Collection of NLP notes ##N-grams ###Calculating unigram probabilities: P( wi ) = count ( wi ) ) / count ( total number of words ) In english..

wi-1 ) = count ( wi-1, wi ) / count ( wi-1 ) In english..

wi-1 wi-2 ) = count ( wi, wi-1, wi-2 ) / count ( wi-1, wi-2 ) In english...

Probability that we saw wordi-1 followed by wordi-2 followed by wordi = [Num times we saw the three words in order] / [Num times we saw wordi-1 followed by wordi-2] ###Interpolation using N-grams We can combine knowledge from each of our n-grams by using interpolation.

In Stupid Backoff, we use the trigram if we have enough data points to make it seem credible, otherwise if we don't have enough of a trigram count, we back-off and use the bigram, and if there still isn't enough of a bigram count, we use the unigram probability.

[Num things with frequency 1] / [Num things] Modified Good-Turing count: count* = [ (count + 1) x Nc+1 ] / [ Nc ] Assuming our corpus has the following frequency count: carp: 10 perch:

( catfish ) = N1 / N = 3 / 18 Calculating the modified count of something we've seen: count* ( trout ) =

2 / 3 Calculating the probability of something we've seen: P* ( trout ) = count ( trout ) / count ( all things ) = (2/3) / 18 = 1/27 What happens if we don't have a word that occurred exactly Nc+1 times?

problem with Good-Turing smoothing is apparent in analyzing the following sentence, to determine what word comes next: The word Francisco is more common than the word glasses, so we may end up choosing Francisco here, instead of the correct choice, glasses.

wi-1 ) = [ max( count( wi-1, wi ) - d, 0) ] / [ count( wi-1 ) ] + Θ( wi-1 ) x Pcontinuation( wi ) Where: Pcontinuation( wi ) represents the continuation probability of wi.

We calculate this as follows: Θ( wi-1 ) = { d * [ Num words that can follow wi-1 ] } / [ count( wi-1 ) ] ###Kneser-Ney Smoothing for N-grams The Kneser-Ney probability we discussed above showed only the bigram case.

wi-n+1i-1) = [ max( countkn( wi-n+1i ) - d, 0) ] / [ countkn( wi-n+1i-1 ) ] + Θ( wi-n+1i-1 ) x Pkn( wi |

The corrected word, w*, is the word in our vocabulary (V) that has the maximum probability of being the correct word (w), given the input x (the misspelled word).

w ) = Suppose we have the misspelled word x = acress We can generate our channel model for acress as follows: actress =>

Given a sentence w1, w2, w3, ..., wn Generate a set of candidate words for each wi Note that the candidate sets include the original word itself (since it may actually be correct!) Then we choose the sequence of candidates W that has the maximal probability.

In practice, we simplify by looking at the cases where only 1 word of the sentence was mistyped (note that above we were considering all possible cases where each word could have been mistyped).

It relies on a very simple representation of the document (called the bag of words representation) Imagine we have 2 classes ( positive and negative ), and our input is a text representing a review of a movie.

c ) x P( c ) ] / [ P( d ) ] The class mapping for a given document is the class which has the maximum value of the above probability.

in the case of classes positive and negative, we would be calculating the probability that any given review is positive or negative, without actually analyzing the current input document.

We use some assumptions to simplify the computation of this probability: It is important to note that both of these assumptions aren't actually correct - of course, the order of words matter, and they are not independent.

( ci ) = [ Num documents that have been classified as ci ] / [ Num documents ] In english..

cj ) = [ count( wi, cj ) ] / [ Σw∈V count ( w, cj ) ] In english...

The probability of word i given class j is the count that the word occurred in documents of class j, divided by the sum of the counts of each word in our vocabulary in class j.

This is BAD Since we are calculating the overall probability of the class by multiplying individual probabilities for each word, we would end up with an overall probability of 0 for the positive class.

cj ) = [ count( wi, cj ) + 1 ] / [ Σw∈V( count ( w, cj ) + 1 ) ] This can be simplified to P

cj ) = [ count( wi, cj ) + 1 ] / [ Σw∈V( count ( w, cj ) ) + |V|

c ) = [ count( w, c ) + 1 ] / [ count( c ) + |V|

the count of how many times this word has appeared in class c, plus 1, divided by the total count of all words that have ever been mapped to class c, plus the vocabulary size. This

c ) for each word w in the new document, then multiply by P( c ), and the result is the probability that this document belongs to this class.

If we have a sentence that contains a title word, we can upweight the sentence (multiply all the words in it by 2 or 3 for example), or we can upweight the title word itself (multiply it by a constant).

angry, sad, joyful, fearful, ashamed, proud, elated diffuse non-caused low-intensity long-duration change in subjective feeling =>

friendly, flirtatious, distant, cold, warm, supportive, contemtuous Enduring, affectively colored beliefs, disposition towards objects or persons =>

###Baseline Algorithm for Sentiment Analysis Given a piece of text, we perform: ####Tokenization Issues Depending on what type of text we're dealing with, we can have the following issues: Some useful code for tokenizing: ####Classification We will have to deal with handling negation: I

Then, as we count the frequency that but has occurred between a pair of words versus the frequency with which and has occurred between the pair, we can start to build a ratio of buts to ands, and thus establish a degree of polarity for a given word.

PMI( word1, word2 ) = log2 { [ P( word1, word2 ] / [ P( word1 ) x P( word2 ) ] } Then we can determine the polarity of the phrase as follows: Polarity( phrase ) = PMI( phrase, excellent ) - PMI( phrase, poor ) =

log2 { [ P( phrase, excellent ] / [ P( phrase ) x P( excellent ) ] } - log2 { [ P( phrase, poor ] / [ P( phrase ) x P( poor ) ] } Start with a seed set of positive and negative words.

##MaxEnt Classifiers (Maximum Entropy Classifiers) We define a feature as an elementary piece of evidence that links aspects of what we observe ( d ), with a category ( c ) that we want to predict.

This feature would match the following scenarios: Another example feature: This feature picks out from the data cases where the class is DRUG and the current word ends with the letter c.

exp Σ λiƒi(c,d) ] / [ ΣC exp Σ λiƒi(c,d) ] ##Named Entity Recognition Named Entity Recognition (NER) is the task of extracting entities (people, organizations, dates, etc.) from text.

n-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech.

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.[2] n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression.

Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.

these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.[3] 3-grams 4-grams An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.

Note that in a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a 'multinomial distribution').

For language identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for different languages.[4] For sequences of characters, the 3-grams (sometimes referred to as 'trigrams') that can be generated from 'good morning' are 'goo', 'ood', 'od ', 'd m', ' mo', 'mor' and so forth, counting the space character as a gram (sometimes the beginning and end of a text are modeled explicitly, adding '__g', '_go', 'ng_', and 'g__').

For sequences of words, the trigrams (shingles) that can be generated from 'the dog smelled like a skunk' are '# the dog', 'the dog smelled', 'dog smelled like', 'smelled like a', 'like a skunk' and 'a skunk #'.

Practitioners[who?] more interested in multiple word terms might preprocess strings to remove spaces.[who?] Many simply collapse whitespace to a single space while preserving paragraph marks, because the whitespace is frequently either an element of writing style or introduces layout or presentation not required by the prediction and deduction methodology.

For example, they have been used for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from.[5] They have also been very successful as the first pass in genetic sequence search and in the identification of the species from which short sequences of DNA originated.[6] n-gram models are often criticized because they lack any explicit representation of long range dependency.

This is because the only explicit dependency range is (n − 1) tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as wh-movement), this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model).

Modern statistical models are typically made up of two parts, a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data.

Conventional linguistic theory can be incorporated in these features (although in practice, it is rare that features specific to generative or other particular theories of grammar are incorporated, as computational linguists tend to be 'agnostic' towards individual theories of grammar[citation needed]).

The n-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.[7] Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g.

For example, both the strings 'abc' and 'bca' give rise to exactly the same 2-gram 'bc' (although {'ab', 'bc'} is clearly not the same as {'bc', 'ca'}).

For example, z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the 'background' vector).

It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.

The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before – the zero-frequency problem.

In the field of computational linguistics, in particular language modeling, skip-grams[9] are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.[10] They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.

For example, in the input text: the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences Syntactic n-grams are n-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text.[11][12][13] For example, the sentence 'economic news has little effect on financial markets' can be transformed to syntactic n-grams following the tree structure of its dependency relations: news-economic, effect-little, effect-on-markets-financial.[11] Syntactic n-grams are intended to reflect syntactic structure more faithfully than linear n-grams, and have many of the same applications, especially as features in a Vector Space Model.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

(This article was first published on Giga thoughts ...

In this scene  Cooper, a crew member of the Endurance spaceship which is on its way to 3 distant planets via a wormhole, is conversing with TARS which is one of  US Marine Corps former robots some year in the future.

Natural Language has been an area of serious research for several decades ever since Alan Turing in 1950 proposed a test in which a human evaluator would simultaneously judge natural language conversations between another human and a machine, that is designed to generate human-like responses, behind a closed doors.

This title of this post should really be ‘Natural language Processing: What would Shakespeare say, and what would you say’ because this post includes two interactive apps that can predict the next word a) The first app given a (Shakespearean) phrase will predict the most likely word that Shakespeare would have said Try

NLP encompasses many areas from computer science  besides inputs from the domain of  linguistics , psychology, information theory, mathematics and statistics  However NLP is a difficult domain as each language has its own quirkiness and ambiguities,  and English is no different.

The empiricists  approached natural language as a data driven problem based on statistics while the rationalist school led by Noam Chomsky, the linguist,  strongly believed that sentence structure should be analyzed at a deeper level than mere surface statistics.

He cites 2 sentences  (a) ‘colorless green ideas sleep furiously’  (b) ‘furiously sleep ideas green colorless’.  Chomsky’s contention is that while neither sentence or  any of its parts, have ever occurred in the past linguistic experience of  English it can be easily inferred that   (a) is grammatical, while (b) is not.

thanks to great strides in processing power and the significant drop in hardware the empiricists approach to Natural Language Processing  made a comeback in the mid 1980s.  The use of probabilistic language models combined with the increase in the  power of processing saw the rise of the empiricists again.

In this post I showcase 2 Shiny apps written in R that predict the next word given a phrase using  statistical approaches, belonging to the empiricist school of thought.

The 1st one will try to predict what Shakespeare would have said  given a phrase (Shakespearean or otherwise)  and the 2nd is a regular app that will predict what we would say in our regular day to day conversation.

In order to  build a language model the program ingests a large corpora of documents.  For the a) Shakespearean app, the corpus is the “Complete Works of Shakespeare“.  This is also available in Free ebooks by Project Gutenberg but you will have to do some cleaning and tokenzing before using it.

Note: This post  in no way tries to belittle the genius of Shakespeare.  From the table below it can be seen that our day to day conversation has approximately 210K, 181K & 65K unique bigrams, trigrams and quadgrams.

To leave a comment for the author, please follow the link and comment on their blog: Giga thoughts ...

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

jgendron/datasciencecoursera

This report provides documentation describing the process and decisions used to develop a predictive text model for the Data Science Capstone project.

The main body of the report provides the essential discussion of the product development story by summarizing key aspects of the analysis and model building.

The corpus used in this analysis might be considered to have a personality – specifically, that it is a unique collection of words, phrases, and sentences that has been characterized by some exploratory analysis to become familiar with what the corpus is and how it may be used for prediction.

Additional research, creative thinking, and persistent modeling alterations resulted in a predictive text model that balanced accuracy with scalability.

The problem that exists is in analyzing a large corpus of text to discover the structure and arrangement of words within the data in order to analyze the corpus using computational methods. The

essence of this project is to take a corpus (a body) of text from various sources, clean and analyze that text data, and build a predictive model to present the next likely word in a stream of text provided by a user.

Data sciences are increasingly making use of Natural Language Processing combined with statistical methods developed within the arts and humanities decades ago to characterize and leverage the streams of data that are text based and not inherently quantitative.

An origin of natural language processing Alan Turing (1950) opens his influential article 'Computing Machinery and Intelligence' with the statement, 'I propose to consider the question, 'Can machines think?'' (p.

He follows this by outlining something he calls the imagination game played by man A - known as label X, woman B - known as label Y, and interrogator C.

Overall, this Turing Test has become a basis of natural language processing - covering a broad array of uses such as spelling correction, speech recognition, author identification, and prediction of words based on preceding words.

Literature review purpose and findings At the outset of this project, course instructors provided us with various sources on natural language processing, text mining, and various R programming packages they felt would be useful.

Primary goals of the literature review were to understand: This section is not a comprehensive overview of over 40 sources reviewed but merely a summary of two works most influential in shaping the modeling approach taken in this project.

Noteworthy was their information about: reading in corpora into the R environment, explaining functions to transform the data, explaining stemming, stopwords, and tagging parts of speech, considering the issue of text sparsity, and understanding the fundamental of count based analysis.

The authors present a key approach for building prediction models called the N-Gram, which relies on knowledge of word sequences from (N – 1) prior words.

We have to understand the data, determine what should be done with the data, and generate the questions that need to be asked to ascertain whether the data is sufficient to do the job.This section briefly addresses the acquisition, processing, and exploration of the data.

The data was downloaded using the R programming language (R Core Team, 2014) and the elements were extracted using the R text-mining package called tm (Feinerer.

This allowed a database to hold over 3.3 million documents in physical disk memory rather than completely in RAM to reserve processing capacity.

In general, those transformations included: conversion to lower case, ensuring apostrophes were retained to maintain contractions, remove numbers, and remove excess whitespace.

Intermittently in the process, the corpus was written back to disk and the database was re-initialized using the filehash package to reduce the size of data processing in RAM.

This step is about understanding: understanding the relationships between the words and sentences, and other observable artifacts useful to set expectations in the model development.

Highlights of the many hours of exploration including an understanding of relationships between vocabulary size and unique words, distributions of various N-Grams, and information that helped reevaluate the original strategy after the literature review.

Diversity is defined as 'A measure of vocabulary diversity that is approximately independent of sample size is the number of different words divided by the square root of twice the number of words in the sample' (Richards, p.208).

This is in line with a common technique for separating data into a 60 percent training set, a 20 percent test set, and a 20 percent validation set.

Notice in Table 3 that the widely used measure of Type/Token Ratio shows the similarity in complexity as represented by the blogs and the entire corpora of blogs, news, and tweets.

Table 2: Effect of vocabulary size on diversity measures Table 3: Effect of vocabulary size on Type/Token Ratios Understanding the distribution among the word tokens helps shape expectations of the linguistic model.

Figure 2: Viewing frequencies of frequencies of various N-Grams The frequencies of frequencies view of the corpus is an important feature for N-Grams because these counts enable predictions of other words.

Information from the e literature indicate that these frequency tables respond well to regression analysis to smooth out the information and avoid having to keep all the data in storage.

The objective of the modeling phase is to create a model that has that balances accuracy with speed and scalability given that we expect a very large data set and the working corpus.

This proved to be a very efficient way to store the data because it collapsed entire matrices into small number of columns (between three and five), albeit having very long numbers of observations.

The basic flow of the user interface with the algorithm: The algorithm allowed for an accumulation of possible third words within 10 clusters (fixed during prototyping).

The early predictive model - using a small corpus of 30,000 articles and documents - resulted in an accuracy of about 20 percent.

Revisions to the prototype data cleaning were implemented to: include end-of-sentence and number tags, conversion of various ASCII codes to appropriate language wordforms, remove punctuation except apostrophe and <>

Therefore, it was accomplished in a series of iterative chunks – taking a subset of 10,000 documents at a time, processing them, and then using the aggregate, merge, and rbind functions of R.

Those 54 dataframes were then slowly condensed through the merge function to accumulate all of the counts so as to not eliminate a single trigram if it appeared in one data frame when it actually could appear and others.

Once all the data frames were consolidated - and consolidated - and consolidated again, the total number of single trigrams amounted to over 15,000,000 trigrams.

All told, these enhancements reduces the web-based model to a relies on a 42 MB CSV file that is scalable on all laptops, desktops, and mobile platforms tested.

All told, these enhancements reduces the web-based model to a relies on a 42 MB CSV file that is scalable on all laptops, desktops, and mobile platforms tested.

Once loaded, the methods used to access the data frame allow the model to load in less than two minutes and subsequent queries of the model provide near instantaneous results to the user after they provide two words as input.

Using information learned earlier during exploratory data analysis, the algorithms were refined to generate look-up tables that balanced depth of information with speed of processing.

This analysis suggests that a predictive model can be built, but it is most useful for predicting common word stems as opposed to highly specialized language needs of a user.

Katz back off: Termed back off N-Gram modeling, it was developed in 1987 by Katz which predicts first based on non-zero, higher-order N-Grams and will 'back off' to a lower-order N-Gram if there is zero evidence of the higher-order N-Gram.

n-gram predict demo

Automatic Speech Recognition - An Overview

An overview of how Automatic Speech Recognition systems work and some of the challenges. See more on this video at