AI News, Abusing hash kernels for wildly unprincipled machine learning
- On Sunday, June 3, 2018
- By Read More
Abusing hash kernels for wildly unprincipled machine learning
In this post, I’ll go over the trouble with preparing data for machine learning, and then describe hashkernel, a Python module I wrote to demonstrate how a technique called hash kernels can avoid these pitfalls.
While I hope that this post will be accessible to readers without any experience with machine learning, you may find it easier to follow along if you’re already comfortable with concepts such as classification and logistic regression.
Feature vectors are a great way to represent some inputs, like a geometric point in space where each entry in the feature vector is a coordinate, or a fixed-size image where each entry is the brightness of a pixel.
To easily training a classifier on real world data, we need to be able to convert data in an arbitrary format consisting of mixed of types into a feature vector that a classifier can understand.
In this section, I’ll first describe how hash kernels work, and then how they can help us more efficiently convert a piece of input text to a feature vector suitable for machine learning.
In the spam filtering case, where each word is a feature (one if the word is found in an input text, or zero if absent), we hash that word to obtain an index in the array.
observe that in one of their experiments, a collision rate of 94% (that is, 94% of features mapped to the same index as one or more other features) resulted in an increase in the experimental error rate from 5.5% to only 6%.
Now that we’ve seen how hash kernels can power text classifiation tasks, let’s take a look at how we can apply hash kernels to just about any type of structured data.
With virtually no set-up, I was able to train a classifier on each dataset and achieve respectable results, in line with those reported by researchers using other, more labor-intensive machine learning techniques.
You can replicate these results by running hashkernel.py: For comparison, the contributors of the US Adult dataset error rates no better than fourteen percent using a number of different machine learning techniques.
Using these stripped dictionaries as instances, I was able to train a hashkernel classifier with little effort: To evaluate this classifier, I ran leave-one-out cross-validation over the set of 346 Facebook data objects.
To replicate my results, run This output indicates an error rate of twenty six percent, meaning that hashkernel was correct about 74% of the time.
The academic literature contains a wealth of information on the (principled) use of hash kernels: Some of this academic work has been implemented in the featureful Vowpal Wabbit package out of Yahoo- and Microsoft’s research arms.
It further exploits hash kernels by using the hash to map words (or other types of features) to different machines in a cluster in order to parallelize the machine learning process.
- On Monday, July 15, 2019
CppCon 2017: Phil Nash “The Holy Grail! A Hash Array Mapped Trie for C++”
— Presentation Slides, PDFs, Source Code and other presenter materials are available at: — C++ .
Pointers and dynamic memory - stack vs heap
See complete series on pointers here In this lesson, we describe the ..
9.520 - 9/23/2015 - Class 05 - Prof. Lorenzo Rosasco - Dictionaries, Feature Maps and Mercer Theorem
Feature Engineering with H2O - Dmitry Larko, Senior Data Scientist, H2O.ai
This meetup has held in Mountain View on 29th November, 2017. The slides of this meetup can be found here: ...
Introduction to FreeBSD Open Source Operating System: Compare FreeBSD with Linux
Compare FreeBSD with Linux is an excerpt from Introduction to the FreeBSD Open Source Operating System LiveLessons Video Training: ...
Lesson 1: Deep Learning 2018
NB: Please go to to view this video since there is important updated information there. If you have questions, use the forums at ..
C++Now 2017: Allan Deutsch “The Slot Map Data Structure"
— Lightning Talk — Presentation Slides, PDFs, Source Code and other presenter materials are available at: ..
AudioQuilt: 2D Arrangements of Audio Samples using Metric Learning and Kernelized Sorting
This paper appeared in NIME 2014 Abstract: The modern musician enjoys access to a staggering number of audio samples. Composition software can ship with ...
Arrays vs Linked Lists - Computerphile
Which is faster? The results *may* just surprise you. Dr 'Heartbleed' Bagley gives us an in depth shoot-out - Arrays vs Linked Lists... Link to code can be found in ...
Understanding Vulnerabilities 1: C, ASM, and Overflows: Computer Security Lectures 2014/15 S2
This video is part of the computer/information/cyber security and ethical hacking lecture series; by Z. Cliffe Schreuders at Leeds Beckett University. Laboratory ...