AI News, BOOK REVIEW: Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis

Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis

The new Apache Kudu (incubating) columnar storage engine together with Apache

Impala (incubating) interactive SQL engine enable a new fully open source

big data architecture for data that is arriving and changing very quickly.

spent this last week expanding the Kudu Python client (a Cython wrapper for the

While my Kudu patch is still in code review, I will give you a preview here of how it all works.

Since Kudu, a native C++ storage engine, now builds on OS X, I'm writing this blog

using the Kudu client on OS X, so now is a great time for developers on both

was designed for the Hadoop ecosystem in part to simplify architectures involving very fast-arriving and fast-changing data that needs to be immediately available for analytical queries.

In the past, complex architectures were devised using the fast Parquet columnar format stored in HDFS in conjunction with HBase (for new data, but very slow for analytics), but there were numerous drawbacks that made a purpose-built column-oriented storage engine desirable.

For example, while Parquet is extremely fast for analytics, data can only be appended to a dataset and not deleted or updated. You

put together a cool demo showing a real time analytics dashboard powered by Impala and Kudu. For

table columns are typed, and columns can be added and remove from tables.

is stored column-oriented, and individual table columns can be read (or scanned) very fast. You

can be selected by indicating a number of conditions or predicates that must hold true Kudu

does not perform analytics: its job is to manage tabular data and serve it to compute engines as fast as possible. Kudu

add, change, or remove data from a table, you must create a session to group the operations: In[8]: session

you create insert operations and add them to the session and call its flush method: In[9]: for

'spam', 2.49), (4, 'spam', 2.0), (9, 'spam', 3.0)] In[13]: scanner

table method on ic.kudu automatically creates an Impala table whose metadata references the existing data in Kudu: In[17]: purchases

can issue SELECT, INSERT, DELETE, and UPDATE queries on data in Kudu tables via Impala, but for now only SELECT and INSERT operations are available from Ibis. Creating

Impala client's Kudu interface has a method create_table which enables more flexible Impala table creation with data stored in Kudu.

that Impala has neither the notion of primary keys nor non-nullable fields, but this metadata can inform query planning. Now,

'spam', 2.49), (4, 'spam', 2.0), (9, 'spam', 3.0)] In

design is well in line with the broader decoupling and commoditization of open source storage and compute systems that has been going on the last 10 years.

It is the responsibility of productivity-centric programming interfaces like Ibis (which you can think of as a "UI for developers") to enhance interoperability and hide as much complexity from the user as possible. Conclusions Kudu

is an exciting new open source storage technology which, when combined with a high performance compute engine like Impala, enables scalable high performance analytics on fast-changing data sets.

Having this functionality seamlessly available to Python programmers using Ibis will make it much easier to develop end-to-end applications involving big data analytics.

Most importantly, the code that you write will be largely the same whether you have 1000 or 100 billion rows of data. I've

been working to build out the Kudu Python interface so that it's easier for Python users to use the project and participate in the development community.

Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis

(Ibis is a data analysis framework incubating in Cloudera Labs that brings Apache Hadoop scale to Python development.) The new Apache Kudu (incubating) columnar storage engine together with Apache Impala (incubating) interactive SQL engine enable a new fully open source big data architecture for data that is arriving and changing very quickly.

In the past, complex architectures were devised using the fast Apache Parquet columnar format stored in HDFS in conjunction with Apache HBase (for new data, but very slow for analytics), but there were numerous drawbacks that made a purpose-built column-oriented storage engine desirable.

I’ve installed the Kudu Python client and now import it and connect to the Kudu master in the VM: Since this is a brand new cluster, there are no tables created yet: To create one, we first create a schema and then create the table: Now, we can get a handle for this new table and see its schema: Now, let’s insert some data: To add, change, or remove data from a table, you must create a session to group the operations: Now, you create insert operations and add them to the session and call its flush method: Now, suppose we wanted to select some data from the table.

To do this, we create a scanner for the table in question: To read all of the data out, you open the scanner and call one of its read methods: To only read a particular subset of data, you add predicates to the scanner: That’s all we need to know for now.

Let’s take a look: This Impala cluster is built with Kudu support, so I can connect my Ibis client to the Kudu master like so: Now, let’s see about that data we just wrote: The table method on ic.kudu automatically creates an Impala table whose metadata references the existing data in Kudu: The result behaves just like any other Ibis table, such as those you might have used with HDFS or SQLite: You can issue SELECT, INSERT, DELETE, and UPDATE queries on data in Kudu tables via Impala, but for now only SELECT and INSERT operations are available from Ibis.

Using Impala to Query Kudu Tables

The primary key for a Kudu table is a column, or set of columns, that uniquely identifies every row.

You can specify the PRIMARY KEY attribute either inline in a single column definition, or as a separate clause at the end of the column list: When the primary key is a single column, these two forms are equivalent.

column list: The SHOW CREATE TABLE statement always represents the PRIMARY KEY specification as a separate item in the column list: The notion of primary key only applies to Kudu tables.

If an existing row has an incorrect or outdated key column value, delete the old row and insert an entirely new row with the correct primary key.