AI News, Conjecture: Scalable Machine Learning in Hadoop with Scalding
- On Thursday, October 4, 2018
- By Read More
Conjecture: Scalable Machine Learning in Hadoop with Scalding
For instance, we use predictive machine learning models to estimate click rates of items so that we can present high quality and relevant items to potential buyers on the site.
In addition to contributing to on-site experiences, we use machine learning as a component of many internal tools, such as routing and prioritizing our internal support e-mail queue.
for inbound support e-mails, we can assign support requests to the appropriate personnel and ensure that urgent requests are handled by staff more rapidly, helping to ensure a good customer experience.
To quickly develop these types of predictive models while making use of our MapReduce cluster, we decided to construct our own machine learning framework, open source under the name “Conjecture.”
It consists of three main parts: This article is intended to give a brief introduction to predictive modeling, an overview of Conjecture’s capabilities, as well as a preview of what we are currently developing.
For the classification of e-mails as urgent or not, we took the historic e-mails as the inputs, and an indicator of whether the support staff had marked the email as urgent or not after reading its contents.
The choice to store the names of features along with their numeric value allows us to easily inspect for any weirdness in our input, and quickly iterate on models by finding the causes of problematic predictions.
Conjecture, like many current libraries for performing scalable machine learning, leverages a family of techniques known as online learning. In online learning, the model processes the labeled training examples one at a time, making an update to the underlaying prediction function after each observation.
While the online learning paradigm isn’t compatible with every machine learning technique, it is a natural fit for several important classes of machine learning models such as logistic regression and large margin classification, both of which are implemented in Conjecture.
However, traditionally the online learning framework does not directly lend itself to a parallel method for model training. Recent research into parallelized online learning in Hadoop gives us a way to perform sequential updates of many models in parallel across many machines, each separate process consuming a fraction of the total available training data.
of the process for the purpose of further training. Theoretical results tell us, that when performed correctly, this process will result in a reliable predictive model, similar to what would be generated had there been no parallelization.
The key to our distributed online machine learning algorithm is in the definition of appropriate reduce operations so that map-side aggregators will implement as much of the learning as possible.
When the reduce operation is consuming two models which both have some training, they are merged (e.g., by summing or averaging the parameter values) otherwise, we continue to train whichever model has some training already.
This leads to more flexibility in the types of models we can train, and also lends better statistical properties to the resulting models (for example by reducing variance in the gradient estimates in the case of logistic regression).
Since this procedure yields one set of performance metrics for each fold, we take the appropriately weighted means of these, and the observed variance also gives an indication of the reliability of the estimate.
On the other hand if there is a great discrepancy between the performance on different folds then it suggests that the mean will be an unreliable estimate of future performance, and possibly that either more data is needed or the model needs some more feature engineering.
To accomplish this, we deploy our model file (a single JSON string encoding the internal model data structures) to the servers where we instantiate it using PHP’s json_decode() function.
An example of a json-encoded conjecture model is below: Currently the JSON-serialized models are stored in a git repository and deployed to web servers via Deployinator (see the Code as Craft post about it here).
Deployinator broadcasts models to all the servers running code that could reference Conjecture models, including the web hosts, and our cluster of gearman workers that perform asynchronous tasks, as well as utility boxes which are used to run cron jobs and ad hoc jobs.
- On Saturday, September 21, 2019
The 7 Steps of Machine Learning (AI Adventures)
How can we tell if a drink is beer or wine? Machine learning, of course! In this episode of Cloud AI Adventures, Yufeng walks through the 7 steps involved in ...
11. Introduction to Machine Learning
MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..
How computers learn to recognize objects instantly | Joseph Redmon
Ten years ago, researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible. Today, computer vision ...
Hello World - Machine Learning Recipes #1
Six lines of Python is all it takes to write your first machine learning program! In this episode, we'll briefly introduce what machine learning is and why it's ...
Build a TensorFlow Image Classifier in 5 Min
In this episode we're going to train our own image classifier to detect Darth Vader images. The code for this repository is here: ...
Jet Engine, How it works ?
Help us to make future videos for you. Make LE's efforts sustainable. Please support us at Patreon ! The working of a ..
Natural Language Processing: Crash Course Computer Science #36
Today we're going to talk about how computers understand speech and speak themselves. As computers play an increasing role in our daily lives there has ...
5. Stochastic Processes I
MIT 18.S096 Topics in Mathematics with Applications in Finance, Fall 2013 View the complete course: Instructor: Choongbum ..
Homeostasis and Negative/Positive Feedback
Explore homeostasis with the Amoeba Sisters and learn how homeostasis relates to feedback in the human body. This video gives examples of negative ...
Tig Welding Stainless Steel - Walking the Cup vs TIG Finger
A popular tig welding technique for stainless steel is walking the cup. But it does not always work best. And for some ..