AI News, BOOK REVIEW: Open Machine Learning Course. Topic 8. Vowpal Wabbit: Fast Learning with Gigabytes of Data

Open Machine Learning Course. Topic 8. Vowpal Wabbit: Fast Learning with Gigabytes of Data

This week, we’ll cover two reasons for Vowpal Wabbit’s exceptional training speed, namely, online learning and hashing trick, in both theory and practice.

This method was named due to the following fact from calculus: vector of partial derivatives of the function f(x) = f(x1, …, xn) points to the direction of the fastest function growth.

The task is the following: find weights w0 and w1 such that predicting height as yi = w0 + w_1 xi (where yi is i-th height value, xi is i-th weight value) minimizes the squared error (as well as mean squared error since 1/ℓ doesn’t make any difference): We will use gradient descent, utilizing the partial derivatives of SE(w0,w1) over weights w0 and w1.

An iterative training procedure is then defined by simple update formulas (we change model weights in small steps, proportional to a small constant η, towards the antigradient of the function SE(w0, w1)): Computing the partial derivatives, we get the following: This math works quite well as long as the amount of data is not large (we will not discuss issues with local minima, saddle points, choosing the learning rate, moments and other stuff – these topics are covered very thoroughly in the Numeric Computation chapter in “Deep Learning”).

Stochastic gradient descent gives us practical guidance for training both classifiers and regressors with large amounts of data up to hundreds of GBs (depending on computational resources).

Considering the case of paired regression, we can store the training data set (X,y) in HDD without loading it into RAM (where it simply won’t fit), read objects one by one, and update the weights of our model: After working through the whole training dataset, our loss function (for example, quadratic squared root error in regression or logistic loss in classification) will decrease, but it usually takes dozens of passes over the training set to make the loss small enough.

Now, we will introduce the Vowpal Wabbit library, which is good for training simple models with huge data sets thanks to stochastic optimization and another trick, feature hashing.

The fit method of this class finds all unique values and builds the actual mapping between categories and numbers, and the transform method converts the categories into numbers.

For example, we implicitly introduced algebra over the values of the job feature where we can now substract the job of client #2 from the job of client #1 : Does this operation make any sense?

When transformed with One-Hot Encoding, this data can be used with linear models: Real data can be volatile, meaning we cannot guarantee that new values of categorial features will not occur.

Besides that, LabelEncoder requires preliminary analysis of the whole dataset and storage of constructed mappings in memory, which makes it difficult to work with large datasets.

Hash functions can help us find unique codes for different feature values, for example: We will not use negative values or values of high magnitude, so we restrict the range of values for the hash function: Imagine that our data set contains a single (i.e.

His feature vectors will be created similarly as in the case of One-Hot Encoding but in the space with fixed range for all features: We want to point out that we hash not only feature values but also pairs of feature name + feature value.

The following string matches the VW format: Let’s check the format by running VW with this training sample: VW is a wonderful tool for working with text data.

The 20 topics are ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] Lets look at the first document in this collection: Now we convert the data into something Vowpal Wabbit can understand.

Now, we apply our trained model to the test set, saving predictions into a file with the -p flag: Now we load our predictions, compute AUC, and plot the ROC curve: The AUC value we get shows that we have achieved high classification quality.

Vowpal Wabbit is a little picky – it wants labels starting from 1 till K, where K – is the number of classes in the classification task (20 in our case).

If you want to reproduce the results, please download the archive, unzip it, and set the path to imdb_reviews (it already contains train and test subdirectories).

The data is quite clean, so don’t call it “Big Data”, even in a pub. :) We chose only 10 tags: ‘javascript’, ‘java’, ‘python’, ‘ruby’, ‘php’, ‘c++’, ‘c#’, ‘go’, ‘scala’ and ‘swift’.

We will process the training set (3.1 GiB) with Vowpal Wabbit and the following arguments: The model has trained itself and made predictions in less than a minute (check it, these results are reported for a MacBook Pro, mid 2015, 2.2 GHz Intel Core i7, 16GB RAM).

Feature Hashing for Scalable Machine Learning - Nick Pentreath

"Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online ...

NIPS 2011 Big Learning - Algorithms, Systems, & Tools Workshop: Vowpal Wabbit Tutorial

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011 Tutorial: Vowpal Wabbit by John Langford Abstract: We present a ...

NIPS 2011 Big Learning - Algorithms, Systems, & Tools Workshop: Spark: In-Memory Cluster...

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011 Invited Talk: Spark: In-Memory Cluster Computing for Iterative and ...

Practical Learning Algorithms for Structured Prediction

Machine learning techniques have been widely applied in many areas. In many cases, high accuracy requires training on large amount of data, adding more ...

NIPS 2011 Big Learning - Algorithms, Systems, & Tools Workshop: Real time data...

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011 Invited Talk: Real time data sketches by Alex Smola Alex is a ...

NIPS 2011 Big Learning - Algorithms, Systems, & Tools Workshop: Hazy - Making Data-driven...

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011 Invited Talk: Hazy: Making Data-driven Statistical Applications ...

NIPS 2011 Tutorial: Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Alexander Rush and Michael Collins.

NIPS 2011 Big Learning - Algorithms, Systems, & Tools Workshop: Large-Scale Matrix...

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011 Invited Talk: Large-Scale Matrix Factorization with Distributed ...

MurmurHash

MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup. It was created by Austin Appleby in 2008, and exists in a number of ...