AI News, Top 15 Scala Libraries for Data Science in 2018

Top 15 Scala Libraries for Data Science in 2018

It has gained popularity mostly due to the rise of Spark, a big data processing engine of choice, which is written in Scala and thus provides native API in Scala.

Currently, Python and R remain the leading languages for rapid data analysis, as well as building, exploring, and manipulating powerful models, while Scala is becoming the key language in the development of functional products that work with big data, as the latter need stability, flexibility, high speed, scalability, etc.

For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.

In fact, there is just one top-level comprehensive tool that forms the basis for the development of data science and big data solutions in Scala, known as Apache Spark, that is supplemented by a wide range of libraries and instruments written in both Scala and Java.

Breeze provides fast and efficient  manipulations with data arrays, and enables the implementation of many other operations, including the following: Breeze also provides plotting possibilities which we will discuss below.

These libraries are mostly used as text parsers, with Puck being more convenient if you need to parse thousands of sentences due to its high-speed and GPU usage.

Vegas provides declarative visualization that allows you to focus mainly on specifying what needs to be done with the data and conducting further analysis of the visualizations, without having to worry about the code implementation.

The library will amaze you with fast and extensive applications, efficient memory usage and a large set of machine learning algorithms for Classification, Regression, Nearest Neighbor Search, Feature Selection, etc.

It utilizes mathematical formulas to create complex dynamic neural networks through a combination of object-oriented and functional programming.

Summingbird is a domain-specific data processing framework which allows integration of batch and online MapReduce computations as well as the hybrid batch/online processing mode.

The main catalyzer for designing the language came from Twitter developers who were often dealing with writing the same code twice: first for batch processing, then once more for online processing.

Summingbird consumes and generates two types of data: streams (infinite sequences of tuples), and snapshots regarded as the complete state of a dataset at some point in time.

It enables you to easily and efficiently build, evaluate and deploy engines, implement your own machine learning models, and incorporate them into your engine.

The main difference, also considered as the most significant improvement, is the additional layer between the actors and the underlying system which only requires the actors to process messages, while the framework handles all other complications.

All actors are hierarchically arranged, thus creating an Actor System which helps actors to interact with each other more efficiently and solve complex problems by dividing them into smaller tasks.

It assures asynchronous, non-blocking actor-based high-performance request processing, while the internal Scala DSL provides a defining web service behavior, as well as efficient and convenient testing capabilities.

If you have some positive experience with any other useful Scala libraries or frameworks that are worth adding to this list, please feel free to share them in the comment section below.

Top 15 Scala Libraries for Data Science in 2018

It has gained popularity mostly due to the rise of Spark, a big data processing engine of choice, which is written in Scala and thus provides native API in Scala.

Currently, Python and R remain the leading languages for rapid data analysis, as well as building, exploring, and manipulating powerful models, while Scala is becoming the key language in the development of functional products that work with big data, as the latter need stability, flexibility, high speed, scalability, etc.

For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.

In fact, there is just one top-level comprehensive tool that forms the basis for the development of data science and big data solutions in Scala, known as Apache Spark, that is supplemented by a wide range of libraries and instruments written in both Scala and Java.

Breeze provides fast and efficient manipulations with data arrays, and enables the implementation of many other operations, including the following: Breeze also provides plotting possibilities which we will discuss below.

Vegas provides declarative visualization that allows you to focus mainly on specifying what needs to be done with the data and conducting further analysis of the visualizations, without having to worry about the code implementation.

The library will amaze you with fast and extensive applications, efficient memory usage and a large set of machine learning algorithms for Classification, Regression, Nearest Neighbor Search, Feature Selection, etc.

Summingbird is a domain-specific data processing framework which allows integration of batch and online MapReduce computations as well as the hybrid batch/online processing mode.

The main catalyzer for designing the language came from Twitter developers who were often dealing with writing the same code twice: first for batch processing, then once more for online processing.

Summingbird consumes and generates two types of data: streams (infinite sequences of tuples), and snapshots regarded as the complete state of a dataset at some point in time.

It enables you to easily and efficiently build, evaluate and deploy engines, implement your own machine learning models, and incorporate them into your engine.

The main difference, also considered as the most significant improvement, is the additional layer between the actors and the underlying system which only requires the actors to process messages, while the framework handles all other complications.

All actors are hierarchically arranged, thus creating an Actor System which helps actors to interact with each other more efficiently and solve complex problems by dividing them into smaller tasks.

It assures asynchronous, non-blocking actor-based high-performance request processing, while the internal Scala DSL provides a defining web service behavior, as well as efficient and convenient testing capabilities.

If you have some positive experience with any other useful Scala libraries or frameworks that are worth adding to this list, please feel free to share them in the comment section below.

Apache Spark

Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.[2] In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged[3] even though the RDD API is not deprecated.[4][5] The RDD technology still underlies the Dataset API.[6] Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk.

Other streaming data engines that process event by event rather than in mini-batches include Storm and the streaming component of Flink.[18] Spark Streaming has support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.[19] In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has a higher-level interface is also provided to support streaming.[20] Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the alternating least squares (ALS) implementations, and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit.[21] Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including: GraphX is a distributed graph processing framework on top of Apache Spark.

Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database.[23] GraphX provides two separate APIs for implementation of massively parallel algorithms (such as PageRank): a Pregel abstraction, and a more general MapReduce style API.[24] Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support for property graphs (graphs where properties can be attached to edges and vertices).[25] GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.[26] Like Apache Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project.[27] Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license.

Text Classification using Spark Machine Learning

The goal of text classification is the classification of text documents into a fixed number of predefined categories. Text classification has a number of applications ranging from email spam...

Machine Learning with Scala on Spark by Jose Quesada

This video was recorded at Scala Days Berlin 2016 follow us on Twitter @ScalaDays or visit our website for more information Abstract: What new superpowers does it give..

Explore the Deeplearning4j library and Scala

For links to resources, visit Romeo talks with deep learning engineer Francois Garillot about how DeepLearning4J, a Java™ library..

Introduction to Machine Learning on Apache Spark MLlib

Speaker: Juliet Hougland, Senior Data Scientist, Cloudera Spark MLlib is a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning...

MMLSpark: Lessons from Building a SparkML Compatible Machine Learning Library -Miruna Oprescu

"With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data...

Scala and Machine Learning with Andrew McCallum

In this video from the Northeast Scala Symposium, Andrew McCallum, Professor of Computer Science at University of Massachusetts Amherst, is going discuss trends in machine learning using Scala....

Marius van Niekerk | Integrating Scala Java with your Python code

PyData Carolinas 2016 Occasionally Python-focused data shops need to use JVM languages for performance reasons. Generally this necessitates throwing away whole repositories of Python code...

A Machine Learning Data Pipeline - PyData SG

Using Luigi and Scikit-Learn to create a Machine Learning Pipeline which trains a model and predict through a Rest API Speaker: Atreya Biswas Synopsis: A Machine Learning Pipeline can be...

Kafka’s Streams API for Highly Scalable Machine Learning & Deep Learning in Real Time by Kai Waehner

Intelligent real time applications are a game changer in any industry. This session explains how companies from different industries build intelligent real time applications. The first part...

I Love Scala For Machine Learning