AI News, Distributed Machine Learning with Apache Mahout

Distributed Machine Learning with Apache Mahout

This article introduces Mahout, a library for scalable machine learning, and studies potential applications through two Mahout projects.

Originally a subproject of Apache Lucene (a high-performance text search engine library), Mahout has progressed to be a top-level Apache project.  While Mahout has only been around for a few years, it has established itself as a frontrunner in the field of machine learning technologies.

Tame the Machine Learning Beast With Apache Mahout

Mahout is currently (2014/03/21) at version 0.9,  going fast towards the fully mature 1.0 version and some of the newest features include: support for scala, recommenders using search technologies and neural networks classifier implementation (MLP).

It is also important to note that machine learning techniques, depending on the algorithm, won’t be able to answer with 100% precision to the question at hand instead being more of a “most likely” type of solution (e.g.

If the two questions don’t point you to Mahout, maybe it’s a good idea to take a look at some of the existing alternatives like Spark MLlib or Weka library Another important aspect is the fact that Mahout is written in Java and consequently it integrates really well with Java applications.

Some of the use cases where Mahout was used successfully and empowered the business can be found here: One last note before diving into the 3 main categories of algorithms: For the vast majority of machine learning problems the most crucial role in the whole solution is the attention paid to the input data.

There are multiple types of recommenders out there (baseline predictor, item-item or user-user collaborative filtering, content-based recommender, dimensionality reduction, interactive critique-based recommenders) and only a subset are available in Mahout, but for most use-cases the user-based recommender will do the job.

Intuition tells us that since user 1 had a preference for book 101 and users 4 and 5 also had a preference for the same book, we might look at other books that users 4 and 5 liked but user 1 never saw before (so maybe one of the 104, 105, 106 books).

As a simple example, if you have a bunch of books and you want to group them together in different piles, you might choose to group them by author (so one pile for each author) or by similar topic (so the computer science books go in one pile, while the biology ones go in another) or even by the color of their cover.

Consider a set of points in a two-dimension space, where each point is defined by two coordinates x and y:  {1, 1}, {2, 1}, {1, 2}, {2, 2}, {3, 3}, {8, 8}, {9, 8}, {8, 9}, {9, 9}.

This is a simple and obvious example where the input data is already vectorized for us and the metric stands out as the euclidian distance, but understanding this sample and seeing it work in Mahout is all that it takes to start working with clustering.

The results provided by K-Means clustering coincide with our intuition because of the way we represented the input data, the chosen metric ( EuclideanDistanceMeasure ) and finally the chosen clustering algorithm along with all the configuration parameters.

Choosing the right algorithm and configurations parameters along with the right distance measurement and the way to vectorize the input data won’t be an easy and obvious task each time but Mahout helps here as well by providing the means and tools to quickly iterate through this process of fine tuning.

And related to ways of vectorizing the input data, one widespread utilization for clustering is in the context of text documents, where in order to be able to cluster the data you’ll need to put in numbers how important it is a word for a certain document.

Classification is the process of using specific information or input to choose a single selection or target from a short list of predetermined potential responses by using a trained model or previous experiences.

In order to have a working classifier, there is usually a three steps cycle that you’ll have to work your way through: train the model, evaluate and fine-tune the model, use it in production (and gather feedback and metrics) and repeat as long as necessary to get a good result.

After solving the size problem you will need to ask another set of questions that would guide you to a good design for your classifier: Through it all, it’s important to remember that building good classification models requires many cycles of iterative refinement.

The are only a strict set of accepted types and failing to correctly identify them would greatly affect the performance of your classifier: In addition to correctly identify the type of features for your dataset, you will also need to take care not to fall into the “target leak” pit.

This topic will be out target for the classifier and what we would like to do is, after training the classification model with some sample messages belonging to the twenty categories, provide as input an email message and get as output the topic it should be assigned to.

The 20news dataset is already sanitized and the training email messages nicely provided in a separate folder (20news-bydate-train) containing subfolders like talk.politics.mideast, talk.politics.misc or talk.religion.misc which at their turn contain the actual email messages as individual files with a random id as filename.

First we’ll need some feature encoders, provided by Mahout: Then, for each input file (from collection files) encode it and then feed it to the classification model After all the input files made it, we’ll need to close the learning process and write the created model to a file.

In order to use the trained model what you need to do is: then vectorize the input message just as for the training part and send it as input to the classification method Just add the following dependency in your pom (for Maven users) and you are set to use Mahout in stand-alone mode.

In order to use it on top of Hadoop, first you’ll need to set up a Hadoop cluster (for development and experimenting a pseudo distributed single node machine will do) and then you can either write your own map-reduce jobs that will use the existing Mahout algorithms or use directly the provided distributed implementations from Mahout.

In the extracted folder you will find the core job jar (ex: ~/tools/mahout-distribution-0.9/mahout-core-0.9-job.jar ) Please note that some of the configuration instructions specific to Hadoop paths and config files might differ depending on the actual version of Hadoop used.

Machine learning library for .net analog of Apache Mahout [closed]

I don't believe I'm familiar with anything similar to Apache Mahout build on top of .NET, but I believe you could use the following approach to get pretty close (how close you can actually get depends on the specifics of what you're trying to do).

Mahout is in fact a collection of standard machine learning algorithms implemented on top of Apache Hadoop to allow them scale to large data sets, so to get the same effect in a .NET environment you'll need a distributed computation solution (and to keep with the spirit of Mahout, I'd use Map/Reduce implementation), and a machine learning library.

Introduction to Apache Mahout | Edureka

Watch Sample Class recording: ..

Orchestrating the Intelligent Web with Apache Mahout

Presenter(s): Aneesha Bakharia URL: Presenters: Aneesha Bakharia ..

Mahout and Scalable Natural Language Processing

Peter Norvig, the Director of Research at Google, said in the Amazon book review[1] for the book "Statistical Natural Language Processing" "If someone told me I ...

Mahout Item Recommender Tutorial using Java and Eclipse

A basic tutorial on developing your first recommender using the Apache Mahout library. Source code is available at: ...

Big Data Meetup Paso 02 : Mahout y Machine Learning

Mahout y Hadoop 101 (sesión 2)

What is Mahout ? | Edureka

Watch Sample Class Recording: Mahout is a ..

Alfresco Summit 2014: 5-star Ratings & Recommendations with Mahout

Robin Bramley, Chief Scientific Officer at Ixxus Discovery is a key challenge within Information Management. While many approaches focus on metadata or full ...

Recommendation Engines Using ALS in PySpark (MovieLens Dataset)

This tutorial provides an overview of how the Alternating Least Squares (ALS) algorithm works, and, using the MovieLens data set, it provides a code-level ...

Java Data Sci Sol-Big Data & Visuliztn:Train Ol Logistc Regrsion Model Use ApacheMahout|

This playlist/video has been uploaded for Marketing purposes and contains only selective videos. For the entire video course and code, visit ...

Feature Hashing for Scalable Machine Learning - Nick Pentreath

"Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online ...