AI News, Data Center Scale Computing and Artificial Intelligence with Matei Zaharia, Inventor of Apache Spark

Data Center Scale Computing and Artificial Intelligence with Matei Zaharia, Inventor of Apache Spark

At Microsoft, we are privileged to work with individuals whose ideas are blazing a trail, transforming entire businesses through the power of the cloud, big data and artificial intelligence.

And what we saw across all of these is that this type of large data center scale computing was very powerful, there were a lot of interesting applications they could do with them, but just the map-reduce programming model alone wasn’t really sufficient, especially for machine learning – that’s something everyone wanted to do where it wasn’t a good fit but also for interactive queries and streaming and other workloads.

So in Spark we kind of stepped back at them and looked at these and said is there any way we can come up with a common abstraction that can handle these workloads and we ended up with something that was a pretty small change to MapReduce – MapReduce plus fast data sharing, which is the in-memory RDDs in Spark, and just hooking these up into a graph of computations turned out to be enough to get really good performance for all the workloads and matched the specialized engines, and also much better performance if your workload combines a bunch of steps.

One reason is much of machine learning is preparing and understanding the data, both the input data and also actually the predictions and the behavior of the model, and Spark really excels at that ad hoc data processing using code – you can use SQL, you can use Python, you can use DataFrames, and it just makes those operations easy, and, of course, all the operations you do also scale to large datasets, which is, of course, important because you want to train machine learning on lots of data.

Beyond that, it does support iterative in-memory computation, so many algorithms run pretty well inside it, and because of this support for composition and this API where you can plug in libraries, there are also quite a few libraries you can plug in that call external compute engines that are optimized to do different types of numerical computation.

So I think Ray has been focused on reinforcement learning which is where one of the main things you have to do is spawn a lot of little independent tasks, so it’s a bit different from a big data framework like Spark where you’re doing one computation on lots of data – these are separate computations that will take different amounts of time, and, as far as I know, users are starting to use that and getting good traction with it.

think the thing I’m most interested in, both for Databricks products and for Apache Spark, is just enabling it to be a platform where you can combine the best algorithms, libraries and frameworks and so on, because that’s what seems to be very valuable to end users, is they can orchestrate a workflow and just program it as easily as writing a single machine application where you just import a bunch of libraries.

So, I mean, when you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about, you know, about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work.

And the reason that’s important is that for web-scale problems you have lot of labeled data, so for something like web search you can solve it, but for many scientific or business problems you don’t have that, and so, how can you learn from a large dataset that’s not quite in your domain like the web and then apply to something like, say, medical images, where only a few hundred patients have a certain condition so you can’t get a zillion images.

But yeah, there’s everything from new hardware for machine learning where you throw away the constraints that the computation has to be precise and deterministic, to new applications, to things like, for example security of AI, adversarial examples, verifiability, I think they are all pretty interesting things you can do.

In the business space, probably some of the more exciting things are actually dealing with image data, where, using deep learning and transfer learning, you can actually start to reliably build classifiers for different types of domain data.

MZ: Databricks, for people not familiar, we offer basically, a Unified Analytics Platform, where you can work with big data mostly through Apache Spark and collaborate with it in an organization, so you can have different people, developing say notebooks to perform computations, you can have people developing production jobs, you can connect these together into workflows, and so on.

And then another product that we featured a lot at our Spark Summit conference this year is Databricks Delta which is basically a transactional data management layer on top of cloud objects stores that lets us do things like indexing, reliable exactly once stream processing, and so on at very massive scale, and that’s a problem that all our users have, because all our users have to setup a reliable data ingest pipeline.

So, Apple’s internal information security group – this is the group that does network monitoring, basically gets hundreds of terabytes of network events per day to process, to detect intrusions and information security problems.

They spoke about using Databricks Delta and streaming with Apache Spark to handle all of that – so it’s one of the largest applications people have talked about publicly, and it’s very cool because the whole goal there – it’s kind of an arms race between the security team and attackers – so you really want to be able to design new rules, new measurements and add new data sources quickly.

We also have some really exciting health and life sciences applications, so some of these are actually starting to discover new drugs that companies can actually productionize to tackle new diseases, and this is all based on large scale genomics and statistical studies.

Learning Spark: Lightning-Fast Big Data Analysis 1st Edition

if(typeof tellMeMoreLinkData !== 'undefined'){

A.state('lowerPricePopoverData',{'trigger':'ns_68A8HFRQN0ZD3BCAEYG0_40315_1_hmd_pricing_feedback_trigger_product-detail','destination':'/gp/pdp/pf/pricingFeedbackForm.html/ref=_pfdpb/137-3060860-3774422?ie=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&ASIN=1449358624&PREFIX=ns_68A8HFRQN0ZD3BCAEYG0_40315_2_&WDG=book_display_on_website&dpRequestId=68A8HFRQN0ZD3BCAEYG0&from=product-detail&storeID=booksencodeURI('&originalURI=' + window.location.pathname)','url':'/gp/pdp/pf/pricingFeedbackForm.html/ref=_pfdpb/137-3060860-3774422?ie=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&ASIN=1449358624&PREFIX=ns_68A8HFRQN0ZD3BCAEYG0_40315_2_&WDG=book_display_on_website&dpRequestId=68A8HFRQN0ZD3BCAEYG0&from=product-detail&storeID=books','nsPrefix':'ns_68A8HFRQN0ZD3BCAEYG0_40315_2_','path':'encodeURI('&originalURI=' + window.location.pathname)','title':'Tell Us About a Lower Price'});

return {'trigger':'ns_68A8HFRQN0ZD3BCAEYG0_40315_1_hmd_pricing_feedback_trigger_product-detail','destination':'/gp/pdp/pf/pricingFeedbackForm.html/ref=_pfdpb/137-3060860-3774422?ie=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&ASIN=1449358624&PREFIX=ns_68A8HFRQN0ZD3BCAEYG0_40315_2_&WDG=book_display_on_website&dpRequestId=68A8HFRQN0ZD3BCAEYG0&from=product-detail&storeID=booksencodeURI('&originalURI=' + window.location.pathname)','url':'/gp/pdp/pf/pricingFeedbackForm.html/ref=_pfdpb/137-3060860-3774422?ie=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&ASIN=1449358624&PREFIX=ns_68A8HFRQN0ZD3BCAEYG0_40315_2_&WDG=book_display_on_website&dpRequestId=68A8HFRQN0ZD3BCAEYG0&from=product-detail&storeID=books','nsPrefix':'ns_68A8HFRQN0ZD3BCAEYG0_40315_2_','path':'encodeURI('&originalURI=' + window.location.pathname)','title':'Tell Us About a Lower Price'};

return {'trigger':'ns_68A8HFRQN0ZD3BCAEYG0_40315_1_hmd_pricing_feedback_trigger_product-detail','destination':'/gp/pdp/pf/pricingFeedbackForm.html/ref=_pfdpb/137-3060860-3774422?ie=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&ASIN=1449358624&PREFIX=ns_68A8HFRQN0ZD3BCAEYG0_40315_2_&WDG=book_display_on_website&dpRequestId=68A8HFRQN0ZD3BCAEYG0&from=product-detail&storeID=booksencodeURI('&originalURI=' + window.location.pathname)','url':'/gp/pdp/pf/pricingFeedbackForm.html/ref=_pfdpb/137-3060860-3774422?ie=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&ASIN=1449358624&PREFIX=ns_68A8HFRQN0ZD3BCAEYG0_40315_2_&WDG=book_display_on_website&dpRequestId=68A8HFRQN0ZD3BCAEYG0&from=product-detail&storeID=books','nsPrefix':'ns_68A8HFRQN0ZD3BCAEYG0_40315_2_','path':'encodeURI('&originalURI=' + window.location.pathname)','title':'Tell Us About a Lower Price'};

Would you like to tell us about a lower price?If you are a seller for this product, would you like to suggest updates through seller support?

Learning Spark by Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau

Data science, a discipline that has been emerging over the past few years, centers on analyzing data.

Oftentimes, their workflow involves ad hoc analysis, so they use interactive shells (versus building complex applications) that let them see results of queries and snippets of code in the least amount of time. Spark’s

Sometimes, after the initial exploration phase, the work of a data scientist will be “productized,” or extended, hardened (i.e., made fault-tolerant), and tuned to become a production data processing application, which itself is a component of a business application.

For example, the initial investigation of a data scientist might lead to the creation of a production recommender system that is integrated into a web application and used to generate product suggestions to users.

Apache Spark Tutorial: ML with PySpark

A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

These spatial data contain 20,640 observations on housing prices with 9 economic variables: What’s more, you also learn that all the block groups have zero entries for the independent and dependent variables have been excluded from the data.

You already gathered a lot of information by just looking at the web page where you found the data set, but it’s always better to get hands-on and inspect your data with the help of Spark with Python, in this case.

You have to push Spark to work for you, so let’s use the collect() method to look at the header: The collect() method brings the entire RDD to a single machine, and you’ll get to see the following result: Tip: be careful when using collect()!

You learn that the order of the variables is the same as the one that you saw above in the presentation of the data set, and you also learn that all columns should have continuous values.

You’ll get the following result: Alternatively, you can also use the following functions to inspect your data: If you’re used to working with Pandas or data frames in R, you’ll have probably also expected to see a header, but there is none.

To recapitulate, you’ll switch to DataFrames now to use high-level expressions, to perform SQL queries to explore your data further and to gain columnar access.

To make this more visual, consider this first line: The lambda function says that you’re going to construct a row in a SchemaRDD and that the element at index 0 will have the name “longitude”, and so on.

Now that you have your DataFrame df, you can inspect it with the methods that you have also used before, namely first() and take(), but also with head() and show(): You’ll immediately see that this looks much different from the RDD that you were working with before: Tip: use df.columns to return the columns of your DataFrame.

Intuitively, you could go for a solution like the following, where you declare that each column of the DataFrame df should be cast to a FloatType(): But these repeated calls are quite obscure, error-proof and don’t really look nice.

Let’s start small and just select two columns from df of which you only want to see 10 rows: This query gives you the following result: You can also make your queries more complex, as you see in the following example: Which gives you the following result: Besides querying, you can also choose to describe your data and get some summary statistics.

Armughan Ahmad, Dell EMC | Super Computing 2017

Armughan Ahmad, SVP & GM, Hybrid Cloud & Ready Solutions, Dell EMC, talks with Jeff Frick at Super Computing 2017.