AI News, Abusing hash kernels for wildly unprincipled machine learning

Abusing hash kernels for wildly unprincipled machine learning

In this post, I’ll go over the trouble with preparing data for machine learning, and then describe hashkernel, a Python module I wrote to demonstrate how a technique called hash kernels can avoid these pitfalls.

While I hope that this post will be accessible to readers without any experience with machine learning, you may find it easier to follow along if you’re already comfortable with concepts such as classification and logistic regression.

Feature vectors are a great way to represent some inputs, like a geometric point in space where each entry in the feature vector is a coordinate, or a fixed-size image where each entry is the brightness of a pixel.

To easily training a classifier on real world data, we need to be able to convert data in an arbitrary format consisting of mixed of types into a feature vector that a classifier can understand.

In this section, I’ll first describe how hash kernels work, and then how they can help us more efficiently convert a piece of input text to a feature vector suitable for machine learning.

In the spam filtering case, where each word is a feature (one if the word is found in an input text, or zero if absent), we hash that word to obtain an index in the array.

observe that in one of their experiments, a collision rate of 94% (that is, 94% of features mapped to the same index as one or more other features) resulted in an increase in the experimental error rate from 5.5% to only 6%.

Now that we’ve seen how hash kernels can power text classifiation tasks, let’s take a look at how we can apply hash kernels to just about any type of structured data.

With virtually no set-up, I was able to train a classifier on each dataset and achieve respectable results, in line with those reported by researchers using other, more labor-intensive machine learning techniques.

You can replicate these results by running hashkernel.py: For comparison, the contributors of the US Adult dataset error rates no better than fourteen percent using a number of different machine learning techniques.

Using these stripped dictionaries as instances, I was able to train a hashkernel classifier with little effort: To evaluate this classifier, I ran leave-one-out cross-validation over the set of 346 Facebook data objects.

To replicate my results, run This output indicates an error rate of twenty six percent, meaning that hashkernel was correct about 74% of the time.

The academic literature contains a wealth of information on the (principled) use of hash kernels: Some of this academic work has been implemented in the featureful Vowpal Wabbit package out of Yahoo- and Microsoft’s research arms.

It further exploits hash kernels by using the hash to map words (or other types of features) to different machines in a cluster in order to parallelize the machine learning process.

Feature hashing

In machine learning, feature hashing, also known as the hashing trick[1]

It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.

From this, a bag of words (BOW) representation is constructed: the individual tokens are extracted and counted, and each distinct token in the training set defines a feature (independent variable) of each of the documents in both the training and test sets.

Therefore, the bags of words for a set of documents is regarded as a term-document matrix where each row is a single document, and each column is a single feature/word;

The common approach is to construct, at learning time or prior to that, a dictionary representation of the vocabulary of the training set, and use that to map words to indices.

On the contrary, if the vocabulary is kept fixed and not increased with a growing training set, an adversary may try to invent new words or misspellings that are not in the stored vocabulary so as to circumvent a machine learned filter.

Note that the hashing trick isn't limited to text classification and similar tasks at the document level, but can be applied to any problem that involves large (perhaps unbounded) numbers of features.

Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words), then using the hash values directly as feature indices and updating the resulting vector at those indices.

has been suggested that a second, single-bit output hash function ξ be used to determine the sign of the update value, to counter the effect of hash collisions.[1]

When a second hash function ξ is used to determine the sign of a feature's value, the expected mean of each column in the output array becomes zero because ξ causes some collisions to cancel out.[1]

Ganchev and Dredze showed that in text classification applications with random hash functions and several tens of thousands of columns in the output vectors, feature hashing need not have an adverse effect on classification performance, even without the signed hash function.[2] Weinberger

applied their variant of hashing to the problem of spam filtering, formulating this as a multi-task learning problem where the input features are pairs (user, feature) so that a single parameter vector captured per-user spam filters as well as a global filter for several hundred thousand users, and found that the accuracy of the filter went up.[1]

Abusing hash kernels for wildly unprincipled machine learning

In this post, I’ll go over the trouble with preparing data for machine learning, and then describe hashkernel, a Python module I wrote to demonstrate how a technique called hash kernels can avoid these pitfalls.

While I hope that this post will be accessible to readers without any experience with machine learning, you may find it easier to follow along if you’re already comfortable with concepts such as classification and logistic regression.

Feature vectors are a great way to represent some inputs, like a geometric point in space where each entry in the feature vector is a coordinate, or a fixed-size image where each entry is the brightness of a pixel.

To easily training a classifier on real world data, we need to be able to convert data in an arbitrary format consisting of mixed of types into a feature vector that a classifier can understand.

In this section, I’ll first describe how hash kernels work, and then how they can help us more efficiently convert a piece of input text to a feature vector suitable for machine learning.

In the spam filtering case, where each word is a feature (one if the word is found in an input text, or zero if absent), we hash that word to obtain an index in the array.

observe that in one of their experiments, a collision rate of 94% (that is, 94% of features mapped to the same index as one or more other features) resulted in an increase in the experimental error rate from 5.5% to only 6%.

Now that we’ve seen how hash kernels can power text classifiation tasks, let’s take a look at how we can apply hash kernels to just about any type of structured data.

With virtually no set-up, I was able to train a classifier on each dataset and achieve respectable results, in line with those reported by researchers using other, more labor-intensive machine learning techniques.

You can replicate these results by running hashkernel.py: For comparison, the contributors of the US Adult dataset error rates no better than fourteen percent using a number of different machine learning techniques.

Using these stripped dictionaries as instances, I was able to train a hashkernel classifier with little effort: To evaluate this classifier, I ran leave-one-out cross-validation over the set of 346 Facebook data objects.

To replicate my results, run This output indicates an error rate of twenty six percent, meaning that hashkernel was correct about 74% of the time.

The academic literature contains a wealth of information on the (principled) use of hash kernels: Some of this academic work has been implemented in the featureful Vowpal Wabbit package out of Yahoo- and Microsoft’s research arms.

It further exploits hash kernels by using the hash to map words (or other types of features) to different machines in a cluster in order to parallelize the machine learning process.

Arrays vs Linked Lists - Computerphile

Which is faster? The results *may* just surprise you. Dr 'Heartbleed' Bagley gives us an in depth shoot-out - Arrays vs Linked Lists... Link to code can be found in ...

Understanding Vectors - Practical Machine Learning Tutorial with Python p.21

In this tutorial, we cover some basics on vectors, as they are essential with the Support Vector Machine. .

Pointers and dynamic memory - stack vs heap

See complete series on pointers here In this lesson, we describe the ..

Feature Engineering with H2O - Dmitry Larko, Senior Data Scientist, H2O.ai

This meetup has held in Mountain View on 29th November, 2017. The slides of this meetup can be found here: ...

K-means & Image Segmentation - Computerphile

K-means sorts data based on averages. Dr Mike Pound explains how it works. Fire Pong in Detail: Deep Dream: ..

Identity and null Matrices

Create your personal learning account. Register for FREE at DeltaStep is a social initiative by graduates of IIM-Ahmedabad, ..

CppCon 2017: Allan Deutsch “Esoteric Data Structures and Where to Find Them”

— Presentation Slides, PDFs, Source Code and other presenter materials are available at: — We .

Introduction to FreeBSD Open Source Operating System: Compare FreeBSD with Linux

Compare FreeBSD with Linux is an excerpt from Introduction to the FreeBSD Open Source Operating System LiveLessons Video Training: ...

C++Now 2017: Allan Deutsch “The Slot Map Data Structure"

— Lightning Talk — Presentation Slides, PDFs, Source Code and other presenter materials are available at: ..

Understanding Vulnerabilities 1: C, ASM, and Overflows: Computer Security Lectures 2014/15 S2

This video is part of the computer/information/cyber security and ethical hacking lecture series; by Z. Cliffe Schreuders at Leeds Beckett University. Laboratory ...