AI News, How VW Predicts Churn with GPU-Accelerated Machine Learning and Visual Analytics

How VW Predicts Churn with GPU-Accelerated Machine Learning and Visual Analytics

The reason for this is that while each technology in the process leverages GPU’s beautifully on theirown,if data has to leave the GPU to move to the next system in the process, this can have significant latency implications.

So, keeping the data in a GPU buffer through the exploration, extraction, preprocessing, model training, validation, and prediction makes it much faster and simpler.

MapD and Anaconda, another GOAi founding member, are involved in development of pythonic clients such aspymapd(interface toMapD'sSQL engine supporting DBAPI 2.0),pygdf(Python interface to access and manipulate the GPU Dataframe) along with our core platform modules MapD Core SQL engine and MapD Immerse, visual analytics tool.

With the help of Apache Arrow, an efficient data interchange is created between MapD and pygdf to leverage various machine learning tools like h2o.ai, PyTorch, and others.

The example used in this post, the Customer Automotive Churn dataset (which focuses on a real-world problem of customer vehicle churning) has been obtained from Volkswagen as a result of our joint collaboration on implementing analytics workflow on GPUs.

Assuming that you have loaded the churn dataset into MapD, let’s start to build some charts in MapD Immerse, which by default starts on https://localhost:9092 .

The capability used in this post to display charts from different tables in one dashboard is limited to MapD Immerse Enterprise edition, but one can use the Community Edition to create a separate dashboard for each chart.

It can be observed that car models produced in early years, especially models 8, 10, and 11, are more prone to churn.

Each of the two queries had 21 feature columns, and the combined queries had 1.7 Million data points and extraction of data using two queries combined took just 0.45 seconds.

The reason being with the help of Arrow, pointers of memory buffer holding data on GPUs are being passed from MapD to pandas which give us back pygdf dataframe.

So, it took roughly 1.3 seconds to train model with approximately 1.7 million data points, 14 seconds to cross-validate and just 0.3 seconds to copy the data into pandas dataframe.

We can repeat this process multiple times, and the best part about it is that we can find the results within a few minutes as compared to hours assessing and assembling data with the traditional means.

How VW Predicts Churn with GPU-Accelerated Machine Learning and Visual Analytics

The reason for this is that while each technology in the process leverages GPU’s beautifully on theirown,if data has to leave the GPU to move to the next system in the process, this can have significant latency implications.

So, keeping the data in a GPU buffer through the exploration, extraction, preprocessing, model training, validation, and prediction makes it much faster and simpler.

MapD and Anaconda, another GOAi founding member, are involved in development of pythonic clients such aspymapd(interface toMapD'sSQL engine supporting DBAPI 2.0),pygdf(Python interface to access and manipulate the GPU Dataframe) along with our core platform modules MapD Core SQL engine and MapD Immerse, visual analytics tool.

With the help of Apache Arrow, an efficient data interchange is created between MapD and pygdf to leverage various machine learning tools like h2o.ai, PyTorch, and others.

The example used in this post, the Customer Automotive Churn dataset (which focuses on a real-world problem of customer vehicle churning) has been obtained from Volkswagen as a result of our joint collaboration on implementing analytics workflow on GPUs.

Assuming that you have loaded the churn dataset into MapD, let’s start to build some charts in MapD Immerse, which by default starts on https://localhost:9092 .

The capability used in this post to display charts from different tables in one dashboard is limited to MapD Immerse Enterprise edition, but one can use the Community Edition to create a separate dashboard for each chart.

It can be observed that car models produced in early years, especially models 8, 10, and 11, are more prone to churn.

Each of the two queries had 21 feature columns, and the combined queries had 1.7 Million data points and extraction of data using two queries combined took just 0.45 seconds.

The reason being with the help of Arrow, pointers of memory buffer holding data on GPUs are being passed from MapD to pandas which give us back pygdf dataframe.

So, it took roughly 1.3 seconds to train model with approximately 1.7 million data points, 14 seconds to cross-validate and just 0.3 seconds to copy the data into pandas dataframe.

We can repeat this process multiple times, and the best part about it is that we can find the results within a few minutes as compared to hours assessing and assembling data with the traditional means.

Nvidia developer blog

GOAI—also joined by BlazingDB, Graphistry and the Gunrock project from the University of California, Davis—aims to create open frameworks that allow developers and data scientists to build applications using standard data formats and APIs on GPUs.

Bringing standard analytics data formats to GPUs will allow data analytics to be even more efficient, and to take advantage of the high throughput of GPUs.

The GPU Technology Conference, held two weeks ago in San Jose, CA, showcased numerous breakthroughs in deep learning, self driving cars, virtual reality, accelerated computing, and more.

Once data is loaded into the GPU data frame, any application that uses this common API can access and modify the data without leaving the GPU (Figure 2).

This allows users to interact with data using SQL or Python, and then run machine learning algorithms on it, all without transferring the data back to the CPU.

The number of unique values for each column is calculated on the GPU by the GPU data frame function unique_k(), and each column is classified as categorical if there are fewer than 1000 unique values, or numerical if there are more than 1000 unique values.

Now that we have numerical and categorical variables, we should center to the mean and scale componentwise to unit variance: commonly called standardizing.

Next, we tell H2OAIGLM to remove the mean income (intercept) in order to fit the residual income for more accuracy, and that the data is already standardized so the GLM does not need to do it again.

In order to avoid fitting to noise in the data, the elastic net GLM (net) fits using regularization in two forms.

The second form, L2 ridge regularization, tries to suppress complicated fits in favor of simple generalizable fits.

These parameters each vary over the entire span of possible values (0 to 1 for alpha and lambda_min_ratio to 1 or the maximum possible lambda value);

The plot in Figure 4 shows how data points (red) can be fit with a complicated (blue line) or simple (green line) solution.

The blue line represents a model that overfits the data and therefore does not generalize well, because it relies too heavily on very specific data values that could be contaminated by noise, vary over time, or differ in each instance for other reasons.

In order to further regularize the fit to ensure it can generalize well, we also perform cross validation, which removes a portion of training data and uses that portion to test how well the model fits.

 Given 5 folds, 8 alphas, and 100 lambdas, this method will run 4000 models using the maximum number of GPUs on a system (nGPUs = maxNGPUS).

The validation data was not used to make the model, so can be trusted as an impartial data set to test if the model generalizes well and remains accurate.

Red corresponds to no better than using the mean income to predict any income, while green indicates a more accurate model that does much better than just using the mean income to predict the income.

The models in green use various features in the data, combined in a linear way, to produce a model that best matches the training data while being expected to generalize well to unseen test data.

On the other hand, by the end of the video, after about 215 seconds have elapsed, the dual Xeon system has only trained and evaluated 209 models.

Currently, the GPU data frame is a single-GPU data structure with support for multi-GPU model-parallel training, where each GPU gets a duplicate of the data and trains an independent model.

GOAI plans to add support for multi-GPU distributed data frames and data-parallel training (where GPUs work together to train a model) in the future.

In the future,  H2O plans to provide additional machine learning models, such as gradient boosting machines (GBM), support vector machines (SVM), k-means clustering, and more, with multi-GPU data-parallel and model-parallel training support.

Integrating scale-out data warehousing, graph visualization, and graph analytics will give data scientists more tools to analyze even larger and more complex datasets.

The GPU Data Frame, along with its Python API, is the first project of the GPU Open Analytics Initiative aimed at creating common data frameworks that enable application developers and end users to accelerate data science on GPUs. We encourage you to try it out today and join the discussions on the GOAI Google Group and GitHub.

End-to-End Machine Learning with GOAI

In May 2017, MapD along with H2O.ai and Continuum Analytics announced theGPU Open Analytics Initiative (GOAI), with the goal of accelerating end-to-end analytics and machine learning on GPUs.

By adopting an existing standard, it is easier to tie in the GDF with the ecosystem and infrastructure already built around Arrow, allowing MapD and others in GOAI to leverage existing features like Parquet-to-Arrow conversion.

While GPUs can provide 100x more processing cores and 20x greater memory bandwidth than CPUs, systems and platforms are unable to harness these disruptive performance gains because they remain isolated from each other.

The initial GOAI prototype integrated the GDF to allow seamless passing of data among processes running on the same GPUs and was shown to provide significant speedups (i.e.

The net result enables lightning-fast interactive data exploration and analysis, feature selection, model training, and model validation by virtue of avoiding any serialization overhead when moving data between processes.

We were able totransformthe process of machine learning into an interactive experience by outputting MapD query results into aGPU Data Frame (GDF)and piping them directly to Anaconda and H2O.ai for further processing.

While the machine learning algorithms themselves typically get most of the attention, the data science workflow around exploring data, feature engineering, and iterative model training usually takes most of a data scientist’s time.

As you can see, building an accurate predictive model is a highly iterative process that benefits from being able to visually explore the data at interactive speeds.The end-to-end machine learning powered by GOAI helps to: A

GDFs break down the silos between systems and software to enable interactive data exploration, feature engineering, model training and model scoring.

MapD, H2O.ai, NVIDIA to Unveil GPU Data Frame at Strata

And, as Ricky Ricardo used to say, we’ve “got some splainin’ to do.” The GDF speeds up data science workflows by allowing them to be carried out entirely on GPUs.

One of the most tedious and time-consuming parts of building a machine learning model is feature engineering — “the process of using domain knowledge of the data to create features that make machine learning algorithms work.” The data scientists we meet aren’t engineering four features on 1,000 row datasets.

Without the analytic acceleration made possible by MapD, that feature engineering sucks up hours or days of a data scientist’s limited time.

Inevitably, the model’s first training iteration returns results that could be improved, and then it’s back to the feature engineering and another training attempt.

We invite all technologists and data scientists interested in accelerating data science on GPUs to join us and contribute to the PyGDF github open-source repository.

GPU-Accelerated Big Data Analytics

MapD Immerse is a web-based data visualization interface, leveraging the GPU-accelerated speed and rendering capabilities of MapD Core and MapD Render for unparalleled visual interaction.

MapD Immerse is incredibly intuitive and easy to use, providing standard visualizations, such as line, bar, and pie charts, as well as complex data visualizations, such as geo-point maps, geo heat maps, chloropleths, and scatter plots.

O'Reilly AI NYC 2017 : Learn how a GPU database helps you deploy an easy-to-use scalable AI solution

Artificial intelligence's promise is to change how we work and live. With cognitive applications in healthcare, retail, financial services, manufacturing, and ...

IBM Scientists Develop Algorithm to Accelerate Machine Learning Training

IBM and EPFL scientists have developed a scheme for workloads where big data sets need to be trained quickly. In their paper they report on how their scheme ...

Custom Application for Urban Planning

Pactriglo is a Los Angeles-based real estate intelligence platform that provides competitive insights to developers who build apartments. The company ...

Scalable Machine Learning in R and Python with H2O

This is a recording of the first East Bay AI and Deep Learning meetup hosted at WeWork Berkeley on May 3, 2017. Please excuse the audio as this session was ...

Better Together: Fast Data with Apache Spark and Apache Ignite

Want the presentation slides? Register here: Apache Spark and Apache Ignite are two powerful solutions for high-performance Big Data and ..