AI News, A Scikit-learn pipeline in Wallaroo
- On Wednesday, August 1, 2018
- By Read More
A Scikit-learn pipeline in Wallaroo
While it would seem that machine learning is taking over the world, a lot of the attention has been focused towards researching new methods and applications, and how to make a single model faster.
At Wallaroo Labs we believe that, to make the benefits of machine learning ubiquitous, there needs to be a significant improvement in how we put those impressive models into production.
This is where the stream computing paradigm becomes useful: as for any other type of computation, we can use streaming to apply machine learning models to a large quantity of incoming data, using available techniques in distributed computing.
In this example, we will explore how we can build a machine learning pipeline inside Wallaroo, our high-performance stream processing engine, to classify images from the MNIST dataset, using a basic two-stage model in Python.
While recognizing hand-written digits is a practically solved problem, even a simple example like the one we are presenting provides a real use case (imagine automated cheque reading in a large bank), and the same setup can be used as a starting point for virtually any machine learning application - just replace the model.
The MNIST dataset is a set of 60000 black and white images, of size 28 x 28 pixels, containing hand-written digits from 0 to 9.
While training is indeed a fundamental part of the machine learning process, stream computing lends itself much better to those situations where the model is being used for inference, perhaps as part of a more significant pipeline which may include data pre-processing and result interpretation.
We invite you to take a detailed look at it, but for the sake of this blog entry, we only need to know that it is training a PCA for data preprocessing and a logistic regression for classification.
To run our application, we need to follow these steps: This will send the entire MNIST dataset to the Wallaroo application and will send the encoded output classifications to the nc program.
A lot of extra functionality can be added to production-level code, but for the purpose of illustrating how to run scikit-learn algorithms in Wallaroo, we preferred to narrow the focus and reduce distractions.
Our VP of Engineering walks you through the concepts that were covered in this blog post using our Python API and then shows the word count application scaling by adding new workers to the cluster.
- On Monday, March 25, 2019
Scalable Stream Processing: A Survey of Storm, Samza, Spark and Flink by Felix Gessert
Batch-oriented systems have done the heavy lifting in data-intensive applications for decades, but they do not reflect the unbounded and continuous nature of ...
Kills With One Bite | National Geographic
Scientists are still figuring out the mysteries of the Komodo dragon. How is it capable of killing large prey with a single bite? ➡ Subscribe: ...
Purdue zipTrips: We're All Animals (6th Grade)
Kids and animals may not look a lot alike, but they still have many similarities, especially in how they move and live. Join Purdue scientists as they take you on a ...
The Blumenthals of the Upper-Lower Peninsula of Michigan
In celebration of Jewish-American Heritage Month, genealogist Janette Silverman discussed methodology and resources for doing genealogical research.