AI News, Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline (Examples from Stripe, Tapad, Etsy & Square)

Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline (Examples from Stripe, Tapad, Etsy & Square)

In response to a recent post from MongoHQ entitled “You don’t have big data,' I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

(You know they do this, right?) But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data.

The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.” But this raises the question - what are the characteristics of a well-designed data pipeline?

Most of our data stores are updated in real-time from processes consuming the message queues (hot data is pushed to Aerospike and Cassandra, real-time queryable data to Vertica and the raw events, often enriched with data from our Aerospike cluster, is stored in HDFS) -

We strive to make our computation logic so that it can be run in-stream *and* in batch MR-mode without any modification” He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.

Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them: '- Kafka: High-throughput parallel pub-sub, but relaxed delivery and latency guarantees, limited data retention and no querying capabilities.

Aerospike: Extremely fast random access read/write performance, by key (we have 3.2 billion keys and 4TB of replicated data), cross data center replication, high availability but very limited querying capabilities -

For every payment received there are 'roughly 10 to 15 accounting entries required and the reconciliation system must therefore scale at one order of magnitude above that of processing, which already is very large.” Square’s approach to solving this problem leverages stream processing, which allows them to map a corresponding data domain to a different stream.

One example operator is a 'matcher' which takes two streams, extracts similarly kinded keys from those, and produces two streams separated based on matching criterias.” Pascal notes that  the system of stream processing and stream based operators is similar to relational algebra and its operators, but in this case it’s in real-time and on infinite relations.

Streaming Data Pipeline to Transform, Store and Explore Healthcare Dataset With Apache Kafka API, Apache Spark, Apache Drill, JSON and MapR-DB

In the past, big data was interacted with in batch on a once-a-day basis.

Data Pipelines, which combine real-time Stream processing with the collection, analysis and storage of large amounts of data, enable modern, real-time applications, analytics and reporting.

This post is based on a recent workshop I helped develop and deliver at a large health services and innovation company's analytics conference.

This company is combining streaming data pipelines with data science on top of the MapR Converged Data Platform to improve healthcare outcomes, improve access to appropriate care, better manage cost, and reduce fraud, waste and abuse.

Since 2013, Open Payments is a federal program that collects information about the payments drug and device companies make to physicians and teaching hospitals for things like travel, research, gifts, speaking fees, and meals.

common data pipeline architecture pattern is event sourcing using an append only publish subscribe event stream such as MapR Event Streams (which provides a Kafka API).

MapR-ES Topics are logical collections of events, which organize events into categories and decouple producers from consumers, making it easy to add new producers and consumers.

MapR-ES can scale to very high throughput levels, easily delivering millions of messages per second using very modest hardware.

Stream processing of events is useful for filtering, transforming, creating counters and aggregations, correlating values, joining streams together, machine learning, and publishing to a different topic for pipelines.

With MapR-DB (HBase API or JSON API), a table is automatically partitioned into tablets across a cluster by key range, providing for scalable and fast reads and writes by row key.

The Spark MapR-DB Connector enables users to perform complex SQL queries and updates on top of MapR-DB using a Spark Dataset, while applying critical techniques such as Projection and filter pushdown, custom partitioning, and data locality.

Dataset is a newer interface, which provides the benefits of strong typing, the ability to use powerful lambda functions, efficient object serialization/deserialization , combined with the benefits of Spark SQL's optimized execution engine.

Drill provides a massively parallel processing execution engine, built to perform distributed query processing across the various nodes in a cluster.

Summary In this blog post, you've learned how to consume streaming Open Payments CSV data, transform to JSON, store in a document database, and explore with SQL using Apache Spark, MapR-ES MapR-DB, OJAI, and Apache Drill Code All of the components of the use case architecture we just discussed can run on the same cluster with the MapR Converged Data Platform.

This example was developed using the MapR 6.0 container for developers , a docker container that enables you to create a single node MapR cluster.

Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1

However, building production-grade continuous applications can be challenging, as developers need to overcome many obstacles, including: Structured Streaming in Apache Spark builds upon the strong foundation of Spark SQL, leveraging its powerful APIs to provide a seamless query interface, while simultaneously optimizing its execution engine to enable low-latency, continually updated answers.

This blog post kicks off a series in which we will explore how we are using the new features of Apache Spark 2.1 to overcome the above challenges and build our own production pipelines.

Using this pipeline, we have converted 3.8 million JSON files containing 7.9 billion records into a Parquet table, which allows us to do ad-hoc queries on updated-to-the-minute Parquet table 10x faster than those on raw JSON files.

For example, dump the raw data in real time, and then convert it to structured form every few hours to enable efficient queries.

Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.

The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.

In this case, we will transform the raw JSON data such that it’s easier to query using Spark SQL’s built-in support for manipulating complex nested schemas.

We also parse the string event time string in each record to Spark’s timestamp type, and flatten out the nested columns for easier querying.

When it finds new data (i.e., new rows in the Input Table), it transforms the data to generate new rows in the Result Table, which then get written out as Parquet files.

The streaming query writes the Parquet data transactionally such that concurrent interactive query processing will always see a consistent view of the latest data.

This checkpoint directory is per query, and while a query is active, Spark continuously writes metadata of the processed data to the checkpoint directory.

More specifically, on the new cluster, Spark uses the metadata to start the new query where the failed one left off, thus ensuring end-to-end exactly-once guarantees and data consistency (see Fault Recovery section of our previous blog).

It is often complex to set up such a pipeline using most existing systems as you would have to set up multiples processes: a batch job to convert the historical data, a streaming pipeline to convert the live data, and maybe a another step to combine the results.

You can configure the above query to prioritize the processing new data files as they arrive, while using the space cluster capacity to process the old files.

This tunes the query to update the downstream data warehouse more frequently, so that the latest data is made available for querying as soon as possible.

Together, we can define the rawLogs DataFrame as follows: In this way, we can write a single query that easily combines live data with historical data, while ensuring low-latency, efficiency and data consistency.

We shared a high level overview of the steps—extracting, transforming, loading and finally querying—to set up your streaming ETL production pipeline.

In the future blog posts in this series, we’ll cover how we address other hurdles, including: If you want to learn more about the Structured Streaming, here are a few useful links.

Why we need SQL for data stream processing and real-time streaming analytics

There are many reasons as to why—some technical (such as as the power of the language), and some more business-oriented (such as the suitability of SQL platforms for reliable enterprise deployments, better performance and significant lower overall cost when compared with open source and proprietary SQL-like platforms).

   · SQL applications can be built in a fraction of the time required for low-level open source platforms and proprietary SQL-like platforms – a significant cost saving plus much faster time to implementation.  

   · SQL supports user defined operations that can be written in Java and deployed in a SQL query – this covers the small percentage of operations that cannot be readily expressed in SQL for whatever reason.  

would argue that these advantages however are only true of SQL platforms however, and not for Java or proprietary SQL-like platforms, where the latter use SQL-style queries built using Java (or another language) constructs, which sort of defeats the purpose and negates the benefits.

FROM … (stream) WHERE … is a simple example, but adding clauses such as MERGE, JOIN, UNION, OVER, ORDER BY, GROUP BY and PARTITION BY adds powerful real-time correlation and query ability on data streams, particularly when combined with the WINDOW operator for processing data streams over time windows (sliding WINDOWs and with GROUP BY for tumbling windows). SQL

For example, if we wanted to analyze a weather data stream continuously and find the minimum, maximum and average temperatures recorded each minute, we could use the following continuous SQL aggregation query.

However if we wanted a continuous output stream, with the minimum, maximum and average values updated for each and every input data event over the preceding minute, we would execute a sliding window operation using a continuous SQL query as follows. SELECT

are some examples where the power of the SQL syntax and its expressiveness is obvious, but to balance this out with an operation that is frequently held up as an example where proprietary languages may offer better syntax – ‘followed by’ operations.

However, the execution performance of the query will improve significantly if the query planner and optimizer is allowed to do its work, for example, using a streaming pipeline with nth_value to detect a sequence of events over a one minute time window: CREATE

You could argue that an alert_on(A then B then C) construct is more concise, however SQL offers powerful and elegant constructs that are general, can be optimized dynamically for optimum performance given the current underlying hardware configuration, and can be combined with other SQL operations such as sliding windows and GROUP BY.

Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2

This is the third post in a multi-part series about how you can perform complex streaming analytics using Apache Spark.

Kafka is a distributed pub-sub messaging system that is popular for ingesting real-time data streams and making them available to downstream consumers in a parallel and fault-tolerant manner.

This renders Kafka suitable for building real-time streaming data pipelines that reliably move data between heterogeneous processing systems.

The infinite nature of this stream means that when starting a new query, we have to first decide what data to read and where in time we are going to begin.

At a high level, there are three choices: As you will see below, the startingOffsets option accepts one of the three options above, and is only used when starting a query from a fresh checkpoint.

Instead of one option, we split these concerns into two different parameters, one that says what to do when the stream is first starting (startingOffsets), and another that handles what to do if the query is not able to pick up from where it left off, because the desired data has already been aged out (failOnDataLoss).

Spark allows you to read an individual topic, a specific set of topics, a regex pattern of topics, or even a specific set of partitions belonging to a set of topics.

Writing data from any Spark supported data source into Kafka is as simple as calling writeStream on any DataFrame that contains a column named “value”, and optionally a column named “key”.

As in the above example, an additional topic option can be used to set a single topic to write to, and this option will override the “topic”

We’ll specifically examine data from Nest’s cameras, which look like the following JSON: We’ll also be joining with a static dataset (called “device_locations”) that contains a mapping from device_id to the zip_code where the device was registered.

Given a stream of updates from Nest cameras, we want to use Spark to perform several different tasks: While these may sound like wildly different use-cases, you can perform all of them using DataFrames and Structured Streaming in a single end-to-end Spark application!

In the following sections, we’ll walk through individual steps, starting from ingest to processing to storing aggregated results.

Then, we apply various transformations to the data and project the columns related to camera data in order to simplify working with the data in the sections to follow.

Now, let’s generate a streaming aggregate that counts the number of camera person sightings in each zip code for each hour, and write it out to a compacted Kafka topic1 called “nest-camera-stats”.

The above query will process any sighting as it occurs and write out the updated count of the sighting to Kafka, keyed on the zip code and hour window of the sighting.

Over time, many updates to the same key will result in many records with that key, and Kafka topic compaction will delete older updates as new values arrive for the key.

We implemented an end-to-end example of a continuous application, demonstrating the conciseness and ease of programming with Structured Streaming APIs, while leveraging the powerful exactly-once semantics these APIs provide.

In the future blog posts in this series, we’ll cover more on: If you want to learn more about the Structured Streaming, here are a few useful links: To try Structured Streaming in Apache Spark 2.1, try Databricks today.

For possible kafka parameters, see the Kafka consumer config docs for parameters related to reading data, and the Kafka producer config docs for parameters related to writing data.

Webinar: Building a real-time analytics pipeline with BigQuery and Cloud Dataflow (EMEA)

Join the live chat Q&A at: Real-time ingestion and analysis of data streams is ..

Stream Processing Pipeline - Using Pub/Sub, Dataflow & BigQuery

Video on how Google Cloud Platform components like Pub/Sub, Dataflow and BigQuery used to handle streaming data.

Taking KSQL to Production | Level Up your KSQL by Confluent

Try KSQL: | This video describes how to use KSQL in streaming ETL pipelines, scale query processing, isolate workloads, and secure ..

Demo: Build a Streaming Application with KSQL

Try KSQL: | This demo application illustrates how to build a streaming application using KSQL. In the video, Tim will walk through the ..

Easy, Scalable, Fault-tolerant Stream Processing in Apache Spark | DataEngConf NYC '17

Don't miss the next DataEngConf in Barcelona: Download the slides for this talk: Structured Streaming, first .

Developer Preview: KSQL from Confluent

NOTE: There's a new version of this demo! Please watch it instead: More info: | KSQL is an open source .

AWS New York Summit 2016: Introduction to Amazon Kinesis Analytics - Process Streaming Data with SQL

Amazon Kinesis Analytics is the easiest way to process streaming data in real time with standard SQL without having to learn new programming languages or ...

Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an ...

Real-time big data processing with Spark Streaming- Tathagata Das (Databricks)

Live from Spark Summit 2013 // About the Presenter // Tathagata Das is an Apache Spark Committer and a member of the PMC. He's the lead developer behind ...

Azure Friday | Visually build pipelines for Azure Data Factory V2

Gaurav Malhotra shows Donovan Brown how you can now visually build pipelines for Azure Data Factory V2 and be more productive by getting pipelines up ...