AI News, Introduction to Natural Language Processing, Part 1: Lexical Units

Introduction to Natural Language Processing, Part 1: Lexical Units

In this series, we will explorecore concepts related to the study and application of natural language processing.

Consider the process of extracting information from some data generating process: A company wants to predict user traffic on its website so it can provide enough compute resources (server hardware) to service demand.

Natural language processing is the application of the steps above — defining representations of information, parsing that information from the data generating process, and constructing, storing, and using data structures that store information — to information embedded in natural languages.

A digital newspaper may have an archive of online articles that can be used to builda search engine to allow users to find relevant content.Information that is representational of natural language can also be useful for building powerful applications, such as bots that respond to questions or software that translates from one language to another.

While a complete summary of natural language processing is well beyond the scope of this article, we will cover some concepts that are commonly used in general purpose natural language processing work.

In some applications, researchers capture these patterns with multiple complex regex queries and morphology-specific rules, and pass the text input through a finite state machine to determine the correct tokenization.

To adapt to a new corpus, tokenizers can be built by training statistical models on hand-tokenized text, though this approach is rarely used in practice due to the success of deterministic approaches.

To add some sophisticationinstead of exhausting all n-grams, we could select the highest order n-gram representation of a set of terms subject to some condition, like whether it exists in a hard-coded dictionary (called a gazetteer) or if it is common in our dataset.

Though smoothing can help ameliorate the problem, these language models tend to have trouble generalizing, and require some amount of transfer learning, feature engineering, determinism, or abstraction.Probabilistic n-gram models require labeledexamples, machine learning algorithms, and feature extractors (the latter two are bundled in Stanford's NER software).

Lemmatisation traditionally requires a morphological parser, in which we completely featurize some unprocessed term (tense, plurality, part of speech, etc.) based on its morphological elements (prefix, suffix, etc.).

To build a parser, we create a dictionary of known stems and affixes (lexicon) with metadata about them, like possible parts of speech, enumerate the rules (morphotactics) governing how morphemes can be compiled together (plural modifier '-s' must follow the noun, for example), and finally enumerate rules (orthographic rules) that govern changes in a word under different morphological states (for instance, a past tense verb ending in '-c' must have a 'k' added, such as 'picnic ->

These rules and terms are passed to a finite state machine that pass over some input, maintaining a state or set of feature values that is then updated as rules and lexicon are checked against the text (similar to how regular expressions work).

Ultimately, the goals of building a preprocessing pipeline include: The definition of relevant, sufficient, and useful depends on the requirements of the project, the strength of the development team, and availability of time and resources.

After translating raw text into a string or tokenized array of lexical units, the researcher or developer may take steps to preprocess his or hertext data, such as string encoding, stop word and punctuation removal, spelling correction, part-of-speech tagging, chunking, sentence segmentation, and syntax parsing.

The table below illustrates some examples of the types of processing a researcher may use given a task and some raw text: After preprocessing, we often need to take additional steps to represent the information in some text quantitatively.

Developing Natural Language Processing With Rules

I demonstrate a simple natural language processing exercise, starting to solve a basic NLP problem with rules. The demo uses Java, rather than a specific NLP ...

Information Extraction - Natural Language Processing | Michigan

Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for "FAIR USE" for purposes such as criticism, comment, news reporting, ...

Natural Language Processing - Data Science Luxembourg

Salience Rank: Efficient Keyphrase Extraction with Topic Modeling by Weiwei Cheng, Machine Learning Scientist at Amazon Modeling legal rules for advanced ...

Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or ...



Natural Language Processing: Crash Course Computer Science #36

Today we're going to talk about how computers understand speech and speak themselves. As computers play an increasing role in our daily lives there has ...

What does GATE do

This is a example in GATE which shows the results of the default ANNIE pipeline on an English document. In this case the document is "That's what she said" ...

How Can Computers Understand Human Language? | Natural Language Processing Explained

Natural language processing allows computers to understand human language. It has plenty of applications. For example: Text summarization, translation, ...

Natural Language Generation at Google Research

In this episode of AI Adventures, Yufeng interviews Google Research engineer Justin Zhao to talk about natural text generation, recurrent neural networks, and ...

Natural Language Processing Exposed: The Art, the Science and the Applications

With Dr. Sid J Reddy. Data Science involves information extraction from unstructured data which is the discovery by computer of new, previously unfound ...