AI News, Introducing DataFrames in Apache Spark for Large Scale Data Science

Introducing DataFrames in Apache Spark for Large Scale Data Science

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.

This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications.

As an extension to the existing RDD API, DataFrames feature: For new users familiar with data frames in other programming languages, this API should make them feel at home.

users = context.table(“users”) logs = context.load(“s3n://path/to/data.json”, “json”) Once built, DataFrames provide a domain-specific language for distributed data manipulation.

Here is an example of using DataFrames to manipulate thedemographic data of a large population of users: young = users.filter(users.age <

This allows their executions to be optimized, by applying techniques such as predicate push-downs and bytecode generation, as explained later in the section “Under the Hood: Intelligent Optimization and Code Generation”.

It can read from local file systems, distributed file systems (HDFS), cloud storage (S3), and external relational database systems via JDBC.

support for data sources enables applications to easily combine data from disparate sources (known as federated query processing in database systems).

For example, the following code snippet joins a site’s textual traffic log stored in S3 with a PostgreSQL database to count the number of times each user has visited the site.

For example, the following code creates a simple text classification pipeline consisting of a tokenizer, a hashing term frequency feature extractor, and logistic regression.

Once the pipeline is setup, we can use it to train on a DataFrame directly: For more complicated tasks beyond what the machine learning pipeline API provides, applications can also apply arbitrarily complex functions on a DataFrame, which can also be manipulated using Spark’s existing RDD API.

In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding.

The above chart compares the runtime performance of running group-by-aggregation on 10 million integer pairs on a single machine (source code).

Since both Scala and Python DataFrame operations are compiled into JVM bytecode for execution, there is little difference between the two languages, and both outperform the vanilla Python RDD variant by a factor of 5 and Scala RDD variant by a factor of 2.

RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

Traditionally, Apache Spark jobs have been written using Resilient Distributed Datasets (RDDs), a Scala Collections-like API. RDDs are type-safe, but they can ...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

"Of all the developers' delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and ...

Peter Hoffmann: Indroduction to the PySpark DataFrame API

Apache Spark is a computational engine for large-scale data processing. It is responsible for scheduling, distribution and monitoring applications which consist ...

Apache Spark - Scala - DataFrame APIs

Connect with me or follow me at

Apache Spark RDD Basics : What is RDD, How to create an RDD

Learning Objectives :: In this module, you will learn what RDD is. You will also learn 2 ways to create an RDD. This video also shows how to create an RDD ...

Apache Spark - Scala - DataFrame Application using Scala IDE

Connect with me or follow me at

Spark Dataframes: Simple and Fast Analysis of Structured Data

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization.

Spark Tutorial - Introduction to Dataframes

Source code and the transcript is available on my website. ..

sparkSQL: RDD to DataFrame Conversion

This video gives you clear idea of how to preprocess the unstructured data using RDD operations and then converting into DataFrame.