AI News, Machine Learning: Full-Text Search in Javascript (Part 1: Relevance Scoring)

Machine Learning: Full-Text Search in Javascript (Part 1: Relevance Scoring)

Full-text search, unlike most of the topics in this machine learning series, is a problem that most web developers have encountered at some point in their daily work.

This article (on TF-IDF, Okapi BM-25, and relevance scoring in general) and the next one (on inverted indices) describe the basic concepts behind full-text search.

There are many, many ways to relate one text to another, but let's start simple and use a statistics-based approach that doesn't need to understand language itself, but rather looks at the statistics of word usage and matches and weighs documents based on the prevalence of their unique words.

All it cares about is the simple fact that there are common words and there are rare words, and if your search phrase includes both common and rare words, you'd be better off to rank the documents that have that rare word in it higher, and put less weight on matched common words.

You can also represent it as the fraction of the number of times a word appears over the total number of tokens (ie, total words) in a document.

Term frequency says "I'm 100 words long and 'the' shows up 8 times, so the term frequency of 'the' is 8 or 8/100 or 8%"

The major improvements that Okapi BM25 bring over TF-IDF are two tunable parameters, called k1 and b, that modulate "term frequency saturation"

Lower values result in quicker saturation (meaning that those two documents above will have similar scores, because they both have a significant number of "baseball"s).

Let's dive into code: We define a simple Tokenize() static method whose purpose is to parse a string into an array of tokens.

Along the way, we lower-case all the tokens (to reduce entropy), we run the Porter Stemmer Algorithm to reduce the entropy of the corpus and also to improve matching (so that "walking"

this.documents is our database of individual documents, but along with storing the full, original text of the document, we also store the document length and a list of all the tokens in the document along with their count and frequency.

Using this data structure we can easily and quickly (with a super-fast, O(1) hash table lookup) answer the question "in document #3, how many times did the word 'walk' occur?"

Here's the method: It's a simple function, but since it loops over the entire corpus terms list, updating each one, it's a somewhat expensive operation.

The implementation is the standard formula for inverse document frequency (which you can easily find on Wikipedia) -- it's the log ratio of total documents to the number of documents a term appears in.

The idf score for each query term is globally pre-calculated and just a simple look-up, the term frequency is document-specific but was also pre-calculated, and the rest of the work is simply multiplication and division!

We add a temporary variable called _score to each document, and then sort the results by the score (descending) and return the top 10.

tf–idf

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1]

tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

Because the term 'the' is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word 'the' more frequently, without giving enough weight to the more meaningful terms 'brown' and 'cow'.

Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents;

Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0.

Although it has worked well as a heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.[7]

However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate event spaces for the required probability distributions: not only documents need to be taken into account, but also queries and terms.[7]

One of the simplest ranking functions is computed by summing the tf-idf for each query term;

Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles

A central question in text mining and natural language processing is how to quantify what a document is about.

We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.

The inverse document frequency for any given term is defined as \[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\] We can use tidy data principles, as described in the main vignette, to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.

Let’s look at the distribution of n/total for each novel, the number of times a word appears in a novel divided by the total number of terms (words) in that novel.

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Jane Austen’s novels as a whole.

Counting Word Frequency using a Dictionary (Chapter 9)

- Python for Everybody: Exploring Data in Python 3.0 Please visit the web site to access a free textbook, free supporting materials, as well ..

Coding Challenge #40.3: TF-IDF

In part 3 of the Word Counting Coding Challenge, I implement an algorithm known as TF-IDF (Term Frequency – Inverse Document Frequency). The algorithm ...

Text Mining in R Tutorial: Term Frequency & Word Clouds

This tutorial will show you how to analyze text data in R. Visit for free downloadable sample data to ..

Python Find Common Word in Text File

Learn how to Find Common Word in Text File using Python.

Word - Generating a Count of Word Occurrences by Chris Menard

If you need to know every word in a Word document, you can use Allen Wyatt's macro to find out. Also shown in this video are text to tables, repeat header row, ...

How to - Online word counter, Word Analysis, Frequency of Words

Word Analysis Online word counte help you stop over-using words in your documents. You can use this online word counter to not just count words but also ...

Run a word frequency query

Find the most frequently used words in your research material. Visualize the results as a word cloud.

Weighting by Term Frequency - Intro to Machine Learning

This video is part of an online course, Intro to Machine Learning. Check out the course here: This course was designed ..

TF/IDF

Full course: We'll introduce the concept of ..

Find top 5 words of a document - C++ program

Given a lengthy list of words (from an input file), output the top 5 popular words with # of occurrences.