AI News, Artificial intelligence meets the C-suite
Artificial intelligence meets the C-suite
The exact moment when computers got better than people at human tasks arrived in 2011, according to data scientist Jeremy Howard, at an otherwise inconsequential machine-learning competition in Germany.
As machine learning progresses at a rapid pace, top executives will be called on to create the innovative new organizational forms needed to crowdsource the far-flung human talent that’s coming online around the globe.
To sort out the exponential advance of deep-learning algorithms and what it means for managerial science, McKinsey’s Rik Kirkland conducted a series of interviews in January at the World Economic Forum’s annual meeting in Davos.
Norton, January 2014)—and two leading entrepreneurs: Anthony Goldbloom, the founder and CEO of Kaggle (the San Francisco start-up that’s crowdsourcing predictive-analysis contests to help companies and researchers gain insights from big data);
The second big deal is the global interconnection of the world’s population, billions of people who are not only becoming consumers but also joining the global pool of innovative talent.
In 2012, a team of four expert pathologists looked through thousands of breast-cancer screening images, and identified the areas of what’s called mitosis, the areas which were the most active parts of a tumor.
The algorithm came back with something that agreed with the pathologists 60 percent of the time, so it is more accurate at identifying the very thing that these pathologists were trained for years to do.
Andrew McAfee: We thought we knew, after a few decades of experience with computers and information technology, the comparative advantages of human and digital labor.
A digital brain can now drive a car down a street and not hit anything or hurt anyone—that’s a high-stakes exercise in pattern matching involving lots of different kinds of data and a constantly changing environment.
The data and the computational capability are increasing exponentially, and the more data you give these deep-learning networks and the more computational capability you give them, the better the result becomes because the results of previous machine-learning exercises can be fed back into the algorithms.
He was able to do that with no particularly special skills and no company infrastructure, because he was building it on top of an existing platform, Facebook, which of course is built on the web, which is built on the Internet.
Jeremy Howard: I think people are massively underestimating the impact, on both their organizations and on society, of the combination of data plus modern analytical techniques.
Then Google fed that to a machine-learning algorithm and said, “You figure out what’s unique about those circled things, find them in the other 100 million images, and then read the numbers that you find.”
So when you switch from a traditional to a machine-learning way of doing things, you increase productivity and scalability by so many orders of magnitude that the nature of the challenges your organization faces totally changes.
I can’t think of a corner of the business world (or a discipline within it) that is immune to the astonishing technological progress we’re seeing.
But if the people currently running large enterprises think there’s nothing about the technology revolution that’s going to affect them, I think they would be naïve.
And it’s very painful—especially for experienced, successful people—to walk away quickly from the idea that there’s something inherently magical or unsurpassable about our particular intuition.
Data will tell you what’s really going on, whereas domain expertise will always bias you toward the status quo, and that makes it very hard to keep up with these disruptions.
What he was stressing was the importance of being able to ask the right questions, and that skill is going to be very important going forward and will require not just technical skills but also some domain knowledge of what your customers are demanding, even if they don’t know it.
Once you get to that point, the best thing you can possibly do is to get rid of the domain expert who comes with preconceptions about what are the interesting correlations or relationships in the data and to bring in somebody who’s really good at drawing signals out of data.
They also have seismic data, where they shoot sound waves down into the rock and, based on the time it takes for those sound waves to be captured by a recorder, they can get a sense for what’s under the earth.
And when you manually interpret what comes off a sensor on a drill bit or a seismic survey, you miss a lot of the richness that a machine-learning algorithm can pick up.
But the pilot programs in big enterprises seem to be very precisely engineered never to fail—and to demonstrate the brilliance of the person who had the idea in the first place.
It has a unit called the human performance analytics group, which takes data about the performance of all of its employees and what interview questions were they asked, where was their office, how was that part of the organization’s structure, and so forth.
Anthony Goldbloom: One huge limitation that we see with traditional Fortune 500 companies—and maybe this seems like a facile example, but I think it’s more profound than it seems at first glance—is that they have very rigid pay scales.
The more rigid pay scales at traditional companies don’t allow them to do that, and that’s irrational because the return on investment on a $5 million, incredibly capable data scientist is huge.
Machine learning and computers aren’t terribly good at creative thinking, so the idea that the rewards of most jobs and people will be based on their ability to think creatively is probably right.
On The Subject of Thinking Machines
68 years ago, Alan Turing proposed the question “Can Machines Think” in his seminal paper titled “Computing Machinery and Intelligence” and he formulated the “Imitation Game” also known as the Turing test as a way to answer this question without referring to a rather ambiguous dictionary definition of the word “Think” We have come a long way to building intelligent machines, in fact, the rate of progress in Deep Learning and Reinforcement Learning, the two corner stones of artificial intelligence, is unprecedented.
We define thinking as “The process by which we evaluate features learned from past experiences in order to make decisions about new problems” In the context of human thinking, when you see a person and you are faced with the task of determining who the person is (The New Problem), a activity (The Process) begins in your brain that goes through the search space of all the people whose face you can remember (The Experience), you then begin to consider the nose, eyes, skin color, dressing, height, speech and any other observable treats (The Features), the process then attempts to match these features to a particular person based on people we have seen before, if no satisfactory match is found, the brain concludes that this is a stranger (The Decision).
Consider a computer vision system on the other hand, trying to perform the same task using Convolutional Neural Networks, when the image of a person is imputed, the 3 Dimensional Tensor of pixels are observed, (The Features), the network then searches (The Process) for the presence of previously learned features called kernels or filters (The Experience), it then compute which of these features are present in the new image and returns a set of class scores (The Decision).
The process is remarkably similar, except for the process by which it is done, for example, Convolutional Neural Networks do not put the position of features into consideration, a nose at the position of the ear makes no difference to a CNN, the process in humans puts this into consideration, however, Capsule Networks, recently proposed by Geoffrey Hinton et al.
Artificial neural networks were inspired by neuroscience but their mechanisms are fundamentally different, we have largely given up on searching for systems that work like the brain, rather we keep searching for systems that work well, without minding how far we are deviating from the way the brain functions.
The ability of an agent to discover new policies or new sub policies in a hierarchical setting , gives such agent the ability to formulate new ideas and even do new things that we never expected it to do in a particular environment.
There are limitations in the form of such discovery being limited by the finite set of actions available to the agent at a given time step or in a state, however, humans are subject to very similar limitations, for example, you cannot just imagine that you want to fly without some jet pack or related equipment.
No mechanism could feel (and not merely artificially signal, an easy contrivance) pleasure at its successes, grief when its valves fuse, be warmed by flattery, be made miserable by its mistakes, be charmed by sex, be angry or depressed when it cannot get what it wants.” The above statement was made by Professor Jefferson in 1949.
The central core of the above assertion is the part, “No mechanism could feel (and not merely artificially signal, an easy contrivance)” The idea that reaction based on artificial signals cannot be described as a feeling is rather contrary to how the human system functions.
Take a humanoid robot, we can build sensors into the body such that when it is touched, signals would be sent to the processing system of the robot, which in this case is acting as a brain, the cumulative signals over a period of time would form a sequence that can then be fed into a type of supervised deep learning system called Recurrent Neural Networks.
In such a scenario, robot A might learn that some actions of an adversarial agent G is causing it to loose power faster than normal, possibly by making it overwork, robot A may then decide not to obey commands giving to it by Agent G, since data from past experience indicate a negative effect from commands initiated by Agent G.
The subconscious part of humans remains largely mysterious, many actions are initiated by it, hence, we can at least have a sense of superiority over robots based on the fact that there is at present no proof that we can infuse machines with sub consciousness within the framework of Deep Learning and Reinforcement learning.
Lecture 11: Introduction to Machine Learning
The following content is provided under a Creative Commons license.
Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free.
To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
In some cases, we said we could use validation to actually let us explore to find the best model that would fit it, whether a linear, a quadratic, a cubic, some higher order thing.
That's a nice segue into the topic for the next three lectures, the last big topic of the class, which is machine learning.
I've listed just five subjects in course six that all focus on machine learning.
So natural language processing, computational biology, computer vision robotics all rely today, heavily on machine learning.
The idea of having examples, and how do you talk about features representing those examples, how do you measure distances between them, and use the notion of distance to try and group similar things together as a way of doing machine learning.
I know labels on my examples, and I'm going to use that to try and define classes that I can learn, and clustering working well, when I don't have labeled data.
Unless Professor Guttag changes his mind, we're probably not going to show you the current really sophisticated machine learning methods like convolutional neural nets or deep learning, things you'll read about in the news.
And I'm going to admit with my gray hair, I started working in AI in 1975 when machine learning was a pretty simple thing to do.
AlphaGo, machine learning based system from Google that beat a world-class level Go player.
Any recommendation system, Netflix, Amazon, pick your favorite, uses a machine learning algorithm to suggest things for you.
The ads that pop up on Google are coming from a machine learning algorithm that's looking at your preferences.
Drug discovery, character recognition-- the post office does character recognition of handwritten characters using a machine learning algorithm and a computer vision system behind it.
Another great MIT company called Mobileye that does computer vision systems with a heavy machine learning component that is used in assistive driving and will be used in completely autonomous driving.
It will do things like kick in your brakes if you're closing too fast on the car in front of you, which is going to be really bad for me because I drive like a Bostonian.
So if you think back to the first lecture in 60001, we showed you Newton's method for computing square roots.
And you could argue, you'd have to stretch it, but you could argue that that method learns something about how to compute square roots.
We gave you a set of data points, mass displacement data points.
1959 is the quote in which he says, his definition of machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.
It beat national level players, most importantly, it learned to improve its methods by watching how it did in games and then inferring something to change what it thought about as it did that.
You may see in a follow on course, he invented what's called Alpha-Beta Pruning, which is a really useful technique for doing search.
And one way to think about this is to think about the difference between how we would normally program and what we would like from a machine learning algorithm.
Normal programming, I know you're not convinced there's such a thing as normal programming, but if you think of traditional programming, what's the process?
I write a program that I input to the computer so that it can then take data and produce some appropriate output.
I wrote code for using Newton method to find a square root, and then it gave me the process of given any number, I'll give you the square root.
And in fact, in a machine learning approach, the idea is that I'm going to give the computer output.
I'm going to give it examples of what I want the program to do, labels on data, characterizations of different classes of things.
And what I want the computer to do is, given that characterization of output and data, I wanted that machine learning algorithm to actually produce for me a program, a program that I can then use to infer new information about things.
And that creates, if you like, a really nice loop where I can have the machine learning algorithm learn the program which I can then use to solve some other problem.
It learned a model for the data, which I could then use to label any other instances of the data or predict what I would see in terms of spring displacement as I changed the masses.
Memorize as many facts as you can and hope that we ask you on the final exam instances of those facts, as opposed to some other facts you haven't memorized.
This is, if you think way back to the first lecture, an example of declarative knowledge, statements of truth.
Better way to learn is to be able to infer, to deduce new information from old.
And if you think about this, this gets closer to what we called imperative knowledge-- ways to deduce new things.
We're interested in extending our capabilities to write programs that can infer useful information from implicit patterns in the data.
So not something explicitly built like that comparison of weights and displacements, but actually implicit patterns in the data, and have the algorithm figure out what those patterns are, and use those to generate a program you can use to infer new data about objects, about string displacements, whatever it is you're trying to do.
So the idea then, the basic paradigm that we're going to see, is we're going to give the system some training data, some observations.
We're going to then try and have a way to figure out, how do we write code, how do we write a program, a system that will infer something about the process that generated the data?
I gave you a set of data, spatial deviations relative to mass displacements.
So it's got all of those elements, training data, an inference engine, and then the ability to use that to make new predictions.
So the more common one is one I'm going to use as an example, which is, when I give you a set of examples, those examples have some data associated with them, some features and some labels.
So I'm going to show you in a second, I'm going to give you a set of examples of football players.
But what we want to do is then see how would we come up with a way of characterizing the implicit pattern of how does weight and height predict the kind of position this player could play.
What we're going to see, and we're going to see multiple examples today, is that that learning can be done in one of two very broad ways.
And in that case, for every new example I give you as part of the training data, I have a label on it.
And what I'm going to do is look for how do I find a rule that would predict the label associated with unseen input based on those examples.
I'm going to just try and find what are the natural ways to group those examples together into different models.
So what I want to do is decide what makes two players similar with the goal of seeing, can I separate this distribution into two or more natural groups.
It says how do I take two examples with values or features associated, and we're going to decide how far apart are they?
And in the unlabeled case, the simple way to do it is to say, if I know that there are at least k groups there-- in this case, I'm going to tell you there are two different groups there-- how could I decide how best to cluster things together so that all the examples in one group are close to each other, all the examples in the other group are close to each other, and they're reasonably far apart.
What I'm going to try and do is create clusters with the property that the distances between all of the examples of that cluster are small.
And see if I can find clusters that gets the average distance for both clusters as small as possible.
This algorithm works by picking two examples, clustering all the other examples by simply saying put it in the group to which it's closest to that example.
This is what my algorithm came up with as the best dividing line here, meaning that these four, again, just based on this axis are close together.
And so any new example, if it's above the line, I would say gets that label, if it's below the line, gets that label.
In a second, we'll come back to look at how do we measure the distances, but the idea here is pretty simple.
But those are tight ends, those are wide receivers, and it's going to come back in a second, but there are the labels.
Basic idea, in this case, is if I've got labeled groups in that feature space, what I want to do is find a subsurface that naturally divides that space.
It says, in the two-dimensional case, I want to know what's the best line, if I can find a single line, that separates all the examples with one label from all the examples of the second label.
And as you already figured out, in this case, with the labeled data, there's the best fitting line right there.
So there are my linemen, the red ones are my receivers, the two black dots are the two running backs.
But you get the sense of why I can use the data in a labeled case and the unlabeled case to come up with different ways of building the clusters.
So what we're going to do over the next 2 and 1/2 lectures is look at how can we write code to learn that way of separating things out?
That's the case where I don't know what the labels are, by simply trying to find ways to cluster things together nearby, and then use the clusters to assign labels to new data.
And we're going to learn models by looking at labeled data and seeing how do we best come up with a way of separating with a line or a plane or a collection of lines, examples from one group, from examples of the other group.
And as a consequence, we're going to have to make some trade-offs between what we call false positives and false negatives.
But the resulting classifier can then label any new data by just deciding where you are with respect to that separating line.
But I might have been better off to pick average speed or, I don't know, arm length, something else.
Starting next week, Professor Guttag is going to show you how you take those and actually start building more detailed versions of measuring clustering, measuring similarities to find an objective function that you want to minimize to decide what is the best cluster to use.
That quote, by the way, is from one of the great statisticians of the 20th century, which I think captures it well.
So feature engineering, as you, as a programmer, comes down to deciding both what are the features I want to measure in that vector that I'm going to put together, and how do I decide relative ways to weight it?
So John, and Ana, and I could have made our job this term really easy if we had sat down at the beginning of the term and said, you know, we've taught this course many times.
Let's just build a little learning algorithm that takes a set of data and predicts your final grade.
So I don't think the month in which you're born, the astrological sign under which you were born has probably anything to do with how well you'd program.
Now I could just throw all the features in and hope that the machine learning algorithm sorts out those it wants to keep from those it doesn't.
By the way, in case you're worried, I can assure you that Stu Schmill in the dean of admissions department does not use machine learning to pick you.
He actually looks at a whole bunch of things because it's not easy to replace him with a machine-- yet.
But from this example, I know that a cobra, it lays eggs, it has scales, it's poisonous, it's cold blooded, it has no legs, and it's a reptile.
But if I give you a second example, and it also happens to be egg-laying, have scales, poisonous, cold blooded, no legs.
Perfectly reasonable model, whether I design it or a machine learning algorithm would do it says, if all of these are true, label it as a reptile.
Let's make it a little more complicated-- has scales, cold blooded, 0 or four legs-- I'm going to say it's a reptile.
So it's an example outside of the cluster that says no scales, not cold blooded, but happens to have four legs.
On those features, there's no way to come up with a way that will correctly say that the python is a reptile and the salmon is not.
And probably my best thing is to simply go back to just two features, scales and cold blooded.
What that means is there's not going to be any instance of something that's not a reptile that I'm going to call a reptile.
And if you think back to my example of the New England Patriots, that running back and that wide receiver were so close together in height and weight, there was no way I'm going to be able to separate them apart.
Because I want to use the distances to figure out either how to group things together or how to find a dividing line that separates things apart.
So one of the ways I could think about this is saying I've got four binary features and one integer feature associated with each animal.
And one way to learn to separate out reptiles from non reptiles is to measure the distance between pairs of examples and use that distance to decide what's near each other and what's not.
Given two vectors and a power, p, we basically take the absolute value of the difference between each of the components of the vector, raise it to the p-th power, take the sum, and take the p-th route of that.
If p is equal to 1, I just measure the absolute distance between each component, add them up, and that's my distance.
The one you've seen more, the one we saw last time, if p is equal to 2, this is Euclidean distance, right?
And to remind you, right, the alligator and the two snakes I would like to be close to one another and a distance away from the frog.
Why should I think that the difference in the number of legs or the number of legs difference is more important than whether it has scales or not?
And in fact, I'm not going to do it, but if I ran Manhattan metric on this, it would get the alligator much closer to the snakes, exactly because it differs only in two features, not three.
Whereas the dart frog, not as far away as it was before, but there's a pretty natural separation, especially using that number between them.
And we already said we're going to look at two different kinds of learning, labeled and unlabeled, clustering and classifying.
And I want to learn what is the best way to come up with a rule that will let me take new examples and assign them to the right group.
There's a third way, which will lead to almost the same kind of result called k nearest neighbors.
And what I'm going to do is, for every new example, say find the k, say the five closest labeled examples.
If 3 out of 5 or 4 out of 5 or 5 out of 5 of those labels are the same, I'm going to say it's part of that group.
That dash line has the property that on the right side you've got-- boy, I don't think this is deliberate, John, right-- but on the right side, you've got almost all Republicans.
And in this case, I get 12 true positives, 13 true negatives, and only 5 false positives.
If sensitivity is how many did I correctly label out of those that I both correctly labeled and incorrectly labeled as being negative, how many them did I correctly label as being the kind that I want?
As I think about the machine learning algorithm I'm using and my choice of that classifier, I'm going to see a trade off where I can increase specificity at the cost of sensitivity or vice versa.
As a team at a major US media company said to me a while ago: 'well, we know we can use ML to index ten years of video of our talent interviewing athletes - but what do we look for?’
This means that image sensors (and microphones) become a whole new input mechanism - less a ‘camera’ than a new, powerful and flexible sensor that generates a stream of (potentially) machine-readable data.
I met a company recently that supplies seats to the car industry, which has put a neural network on a cheap DSP chip with a cheap smartphone image sensor, to detect whether there’s a wrinkle in the fabric (we should expect all sorts of similar uses for machine learning in very small, cheap widgets, doing just one thing, as described here).
A ten year old could sort them into men and women, a fifteen year old into cool and uncool and an intern could say ‘this one’s really interesting’.
It might never get to the intern.But what would you do if you had a million fifteen year olds to look at your data?What calls would you listen to, what images would you look at, and what file transfers or credit card payments would you inspect?
Where this metaphor breaks down (as all metaphors do) is in the sense that in some fields, machine learning can not just find things we can already recognize, but find things that humans can’t recognize, or find levels of pattern, inference or implication that no ten year old (or 50 year old) would recognize.
That is, this not so much a thousand interns as one intern that’s very very fast, and you give your intern 10 million images and they come back and say ‘it’s a funny thing, but when I looked at the third million images, this pattern really started coming out’.
Equally, the only reason we’re talking about autonomous cars and mixed reality is because machine learning (probably) enables them - ML offers a path for cars to work out what’s around them and what human drivers might be going to do, and offers mixed reality a way to work out what I should be seeing, if I’m looking though a pair of glasses that could show anything.
But after we’ve talked about wrinkles in fabric or sentiment analysis in the call center, these companies tend to sit back and ask, ‘well, what else?’ What are the other things that this will enable, and what are the unknown unknowns that it will find?
- On Monday, January 21, 2019
15: Functional Programming with Elm, ClojureScript, Om, and React
Episode 15 deep dives into the programming experiences of Adam Solove (@asolove), Head of Engineering at Pagemodo. Adam has spent the last ten years ...