AI News, Top 15 Scala Libraries for Data Science in 2018

Top 15 Scala Libraries for Data Science in 2018

It has gained popularity mostly due to the rise of Spark, a big data processing engine of choice, which is written in Scala and thus provides native API in Scala.

Currently, Python and R remain the leading languages for rapid data analysis, as well as building, exploring, and manipulating powerful models, while Scala is becoming the key language in the development of functional products that work with big data, as the latter need stability, flexibility, high speed, scalability, etc.

For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.

In fact, there is just one top-level comprehensive tool that forms the basis for the development of data science and big data solutions in Scala, known as Apache Spark, that is supplemented by a wide range of libraries and instruments written in both Scala and Java.

Breeze provides fast and efficient  manipulations with data arrays, and enables the implementation of many other operations, including the following: Breeze also provides plotting possibilities which we will discuss below.

Conveniently, Scalalab gets access to the variety of scientific Java and Scala libraries, so you can easily import your data and then use different methods to make manipulations and computations.

These libraries are mostly used as text parsers, with Puck being more convenient if you need to parse thousands of sentences due to its high-speed and GPU usage.

Vegas provides declarative visualization that allows you to focus mainly on specifying what needs to be done with the data and conducting further analysis of the visualizations, without having to worry about the code implementation.

The library will amaze you with fast and extensive applications, efficient memory usage and a large set of machine learning algorithms for Classification, Regression, Nearest Neighbor Search, Feature Selection, etc.

It utilizes mathematical formulas to create complex dynamic neural networks through a combination of object-oriented and functional programming.

Summingbird is a domain-specific data processing framework which allows integration of batch and online MapReduce computations as well as the hybrid batch/online processing mode.

The main catalyzer for designing the language came from Twitter developers who were often dealing with writing the same code twice: first for batch processing, then once more for online processing.

Summingbird consumes and generates two types of data: streams (infinite sequences of tuples), and snapshots regarded as the complete state of a dataset at some point in time.

It enables you to easily and efficiently build, evaluate and deploy engines, implement your own machine learning models, and incorporate them into your engine.

The main difference, also considered as the most significant improvement, is the additional layer between the actors and the underlying system which only requires the actors to process messages, while the framework handles all other complications.

All actors are hierarchically arranged, thus creating an Actor System which helps actors to interact with each other more efficiently and solve complex problems by dividing them into smaller tasks.

It assures asynchronous, non-blocking actor-based high-performance request processing, while the internal Scala DSL provides a defining web service behavior, as well as efficient and convenient testing capabilities.

If you have some positive experience with any other useful Scala libraries or frameworks that are worth adding to this list, please feel free to share them in the comment section below.

Comparison of top data science libraries for Python, R and Scala [Infographic]

Each of these languages is suitable for a specific type of tasks, besides each developer chooses the most convenient tool for himself.

Primarily designed for statistical computing, R offers an excellent set of high-quality packages for statistical data collection and visualization.

Keep in mind, that the choice of programming language and the libraries that you will use, depends on specific tasks, so it’s beneficial to know what are the strong and weak sides of each of them.

Indeed, this list is not complete, many other valuable tools can and have to be examined, but it will definitely be a good starting point for your journey into data science industry.

Scala vs. Python: machine learning libraries? I’m building a recommendation engine that will use algorithms, heuristics, and machine learning. Which one would have better existing libraries?

Machine Learning is a moving target and keeps mutating.

And Scala has nice concurrency support so if your doing alot of real time parallelized analytics you will want to know it.

However, Scala is catching up and I've noticed the quality of Scala's open source statistical learning, information theory, and AI stuff is superb.

This is a heavy duty tool for predictive science and uncertainty, and they even have particle filters included, which I have been struggling to find robust generic code along with some documentation on how to use this particular voodoo.

Python is easy to learn so nothing wrong with getting feet wet with it and then taking your time with Scala.

Introduction to Machine Learning on Apache Spark MLlib

Speaker: Juliet Hougland, Senior Data Scientist, Cloudera Spark MLlib is a library for performing machine learning and associated tasks on massive datasets.

Hello World - Machine Learning Recipes #1

Six lines of Python is all it takes to write your first machine learning program! In this episode, we'll briefly introduce what machine learning is and why it's ...

Text Classification using Spark Machine Learning

The goal of text classification is the classification of text documents into a fixed number of predefined categories. Text classification has a number of applications ...

Explore the Deeplearning4j library and Scala

For links to resources, visit Romeo talks with deep learning engineer Francois Garillot about how ..

Machine Learning with Scala on Spark by Jose Quesada

This video was recorded at Scala Days Berlin 2016 follow us on Twitter @ScalaDays or visit our website for more information Abstract: What ..

Machine Learning with Clojure and Apache Spark - Eric Weinstein

Machine learning has become an incredibly popular field of research in the last few years. While there's no shortage of libraries and tutorials in languages like ...

The New Collections Library for Scala 2.13 and Dotty—Stefan Zeiger

Introduction to Machine Learning | MLib | Apache Spark & Scala Tutorial

In this Apache Spark & Scala Machine Learning tutorial, the following concepts will be covered: ✓ Machine Learning Introduction ✓ Why Machine Learning ...

Jose Quesada - A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

PyData Berlin 2016 The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly.

A Machine Learning Data Pipeline - PyData SG

Using Luigi and Scikit-Learn to create a Machine Learning Pipeline which trains a model and predict through a Rest API Speaker: Atreya Biswas Synopsis: A ...