AI News, Ibis on Impala: Python at Scale for Data Science

Ibis on Impala: Python at Scale for Data Science

This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale.

For business analysts in particular, who are the rank-and-file of big data consumers, the Hadoop experience is becoming all but indistinguishable from that of traditional data infrastructure but with unprecedented scale, flexibility, and cost-effectiveness under the covers.

While SQL and BI remain the core capability of most analytics environments, advanced statistical analysis—now often known as data science—is becoming an increasingly popular tool in the expanding data analysis toolbox.

And in the data science world, Python has emerged as the most popular choice for expressing such complex workflows—as well as for its value in programmatic data preparation (via the Python pandas framework)—because of its power, elegance, and robust libraries and third-party integrations.

Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases.

Co-founded by the respective architects of the Python pandas toolkit and Impala and now incubating in Cloudera Labs, Ibis is a new data analysis framework with the goal of enabling advanced data analysis on a 100% Python stack with full-fidelity data.

In this initial (unsupported) Cloudera Labs release, Ibis offers comprehensive support for the analytical capabilities presently provided by Impala, enabling Python users to run Big Data workloads in a manner similar to that of “small data” tools like pandas.

Python and Apache Hadoop: A State of the Union

Over the last five years, the rapid growth of Python’s open source data tools have made it a tool of choice for a wide variety of data engineering and data science needs.

During the same time period, the Apache Hadoop ecosystem has risen to the challenge of collecting, storing, and analyzing accelerating volumes of data with the robustness, security, and scalable performance demanded by the world’s largest enterprises.

In this post, we’ll discuss the options currently available to Python users and concrete steps we are taking to better enable a first class Python-on-Hadoop experience.

For big data problems, a common practice is to sample or reduce a dataset to a sufficiently small size, then analyze the result with local tools like Pandas and scikit-learn.

There are several ways to scale Python across a cluster today, including: The original example of Python-on-Hadoop is through Hadoop Streaming, a flexible interface for writing MapReduce jobs in any language capable of sending and receiving data through UNIX pipes, one line at a time.

In addition, developers must pay substantial data serialization costs that undermine the performance benefits of compiled extensions in C, C++, and FORTRAN found in libraries like NumPy, SciPy, Pandas, and scikit-learn.

The current implementation suffers from several challenges, including: Additionally, in order to ensure that Python code developed on the desktop functions also works within the cluster, system administrators have struggled to make popular Python libraries available across Spark worker nodes.

Ibis, which began last year within Cloudera Labs, has targeted Impala for SQL to deliver speedy interactive analytics while simplifying data wrangling tasks involving the Hive metastore and storage systems like HDFS or Apache Kudu (incubating), a new distributed storage engine designed for fast analytics on fast-changing data.

This eases the pain of developers who need the latest and greatest Python libraries to be installed across a cluster, and of the system administrators who must install and manage those libraries.

Getting Started with Ibis and How to Contribute

We created Ibis, a new Python data analysis framework now incubating in Cloudera Labs, with the goal of enabling data scientists and data engineers to be as productive working with big data as they are working with small and medium data today.

Having spent much of the last decade improving the usability of the single-node Python experience (with pandas and other projects), we are looking to achieve: (Read more about the technical vision for Ibis in this post.) The Ibis user interface centers around a general pandas-like data expression API.

Users compose relational algebra, data transformations, and analytics on data in HDFS, and these operations are executed transparently, returning results in the familiar pandas data frame format.

We have focused Ibis development on integration with Impala due to architectural synergies (namely, Impala’s C++ and LLVM-based engine) that will enable Python for the first time to achieve native hardware performance at Hadoop scale and integrate with the existing Python data analysis and high-performance computing ecosystem.

Cloudera Opens Up New Capabilities, Making Hadoop Even More Usable by and Accessible to Data Scientists

ALTO, Calif., – July 20, 2015 – As the amount of data continues to grow exponentially, data scientists increasingly need the ability to perform full-fidelity analysis of that data at massive scale. Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today announced a number of new initiatives to enable data scientists to take advantage of big data and Hadoop for data analysis with more complex workflows. Beginning

has evolved dramatically over the last decade, from a batch processing tool to an entire ecosystem that powers most of today’s information architecture as well as traditional BI workloads,” said Wes McKinney, a software engineer at Cloudera and the creator of Python pandas.

Cloudera recognized the importance of the Python language in modern data engineering and data science and how, thanks to its use of more complex workflows, it has become a primary language for data transformation and interactive analysis.

Conference In light of Hadoop’s wide ranging flexibility and practicality, and as data scientists can now leverage its power to solve some of today’s most pressing problems, Cloudera has announced Wrangle, a single-day, single-track industry event that will dive into the principles, practice, and application of data science from the startup to the enterprise.

SF Big Analytics: Next-generation Python Big Data Tooling, powered by Apache Arrow

Python's data tool ecosystem developed out of a long legacy of scientific and numerical computing software that developed from the late-1990s to the mid-2000s ...

McKinney SF DM talk

full article: We are excited to have Wes McKinney give a demo and discuss the roadmap of Ibis, a new data analytics framework. While Python is a de-facto ...