AI News, Machine Learning (Theory)

Machine Learning (Theory)

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time.

As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm.

The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms.

Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s).

Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter.

Machine Learning (Theory)

In the last 7 years or so there has been quite a bit of work on parallel machine learning approaches, enough that I felt like a summary might be helpful both for myself and others.

For many algorithms, this can provide an easy 10x speedup, with the limits being programming (GPUs are special), the amount of GPU RAM (12GB for a K40), the bandwidth to the GPU interface, and your algorithms needing care as new architectures come out.

I personally would rate it somewhat higher, just because debugging is such an intrinsic part of using machine learning algorithms and the debuggability of nondeterministic algorithms is greatly impaired.

Controlling machine-learning algorithms and their biases

This often-overlooked defect can trigger costly errors and, left unchecked, can pull projects and organizations in entirely wrong directions.

Effective efforts to confront this problem at the outset will repay handsomely, allowing the true potential of machine learning to be realized most efficiently.

In the domain of artificial intelligence, machine learning increasingly refers to computer-aided decision making based on statistical algorithms generating data-driven insights (see sidebar, “Machine learning: The principal approach to realizing the promise of artificial intelligence”).

To create a functioning statistical algorithm by means of a logistic regression, for example, missing variables must be replaced by assumed numeric values (a process called imputation).

Machine learning is able to manage vast amounts of data and detect many more complex patterns within them, often attaining superior predictive power.

With access to the right data and guidance by subject-matter experts, predictive machine-learning models could find the hidden patterns in the data and correct for such spikes.

Confirmation bias is the tendency to select evidence that supports preconceived beliefs, while loss-aversion bias imposes undue conservatism on decision-making processes.

Machine learning is being used in many decisions with business implications, such as loan approvals in banking, and with personal implications, such as diagnostic decisions in hospital emergency rooms.

Where machine learning predicts behavioral outcomes, the necessary reliance on historical criteria will reinforce past biases, including stability bias.

Just as a traumatic childhood accident can cause lasting behavioral distortion in adults, so can unrepresentative events cause machine-learning algorithms to go off course.

Should a series of extraordinary weather events or fraudulent actions trigger spikes in default rates, for example, credit scorecards could brand a region as “high risk”

Companies seeking to overcome biases with statistical decision-making processes may find that the data scientists supervising their machine-learning algorithms are subject to these same biases.

It is frustratingly difficult to shape machine-learning algorithms to recognize a pattern that is not present in the data, even one that human analysts know is likely to manifest at some point.

Since machine-learning algorithms try to capture patterns at a very detailed level, however, every attribute of each synthetic data point would have to be crafted with utmost care.

In 2007, an economist with an inkling that credit-card defaults and home prices were linked would have been unable to build a predictive model showing this relationship, since it had not yet appeared in the data.

As described in a previous article in McKinsey on Risk, companies can take measures to eliminate bias or protect against its damaging effects in human decision making.

First, users of machine-learning algorithms need to understand an algorithm’s shortcomings and refrain from asking questions whose answers will be invalidated by algorithmic bias.

They must understand the true values involved in the trade-off: algorithms offer speed and convenience, while manually crafted models, such as decision trees or logistic regression—or for that matter even human decision making—are approaches that have more flexibility and transparency.

Health-conscious consumers must study literature on nutrition and read labels in order to avoid excess calories, harmful additives, or dangerous allergens.

In credit scoring, for example, built-in stability bias prevents machine-learning algorithms from accounting for certain rapid behavioral shifts in applicants.

Burdened by an exceptionally high monthly installment (due to the short tenor), many of these applicants will ultimately default, causing a spike in credit losses.

Should business users fail to recognize these shifts, banks might be able to identify them indirectly, by monitoring the distribution of monthly applications by loan tenor.

The challenge here is to establish whether a marked shift is due to a deliberate change in behavior by applicants or to other factors, such as changes in economic conditions or a bank’s promotional strategy.

Tests can ensure that unwanted biases of past human decision makers, such as gender biases, for example, have not been inadvertently baked into machine-learning algorithms.

Experts with deep machine-learning knowledge and good business judgment are like experienced gardeners, carefully nurturing the plants to encourage their organic growth.

By using stratified sampling and optimized observation weights, data scientists ensure that the algorithm is most powerful for those decisions in which the business impact of a prediction error is the greatest.

Traditional approaches include human decision making or handcrafted models such as decision trees or logistic-regression models—the analytic workhorses used for decades in business and the public sector to assign probabilities to outcomes.

Three questions can be considered when deciding to use machine-learning algorithms: In addition to these considerations, companies implementing large-scale machine-learning programs should make appropriate organizational and cultural changes to support them.

While not as stringent and formal, the approach is related to mature model development and validation processes by which large institutions are gaining strategic control of model proliferation and risk.

Three building blocks are critically important for implementation: Creating a conscious, standards-based system for developing machine-learning algorithms will involve leaders in many judgment-based decisions.

exercise designed to pinpoint the limitations of a proposed model and help executives judge the business risks involved in a new algorithm.

Modern Machine Learning Algorithms: Strengths and Weaknesses

In this guide, we’ll take a practical, concise tour through modern machine learning algorithms.

For example, Scikit-Learn’s documentation page groups algorithms by their learning mechanism. This produces categories such as: However, from our experience, this isn’t always the most practical way to group algorithms.

That’s because for applied machine learning, you’re usually not thinking, “boy do I want to train a support vector machine today!”

Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in.

As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn't bust out a shovel and start digging.

They are: In Part 2, we will cover dimensionality reduction, including: Two notes before continuing: Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores.

decision trees) learn in a hierarchical fashion by repeatedly splitting your dataset into separate branches that maximize the information gain of each split.

We won't go into their underlying mechanics here, but in practice, RF's often perform very well out-of-the-box while GBM's are harder to tune but tend to have higher performance ceilings.

They use 'hidden layers' between inputs and outputs in order to model intermediary representations of the data that other algorithms cannot easily learn.

However, deep learning still requires much more data to train compared to other algorithms because the models have orders of magnitudes more parameters to estimate.

These algorithms are memory-intensive, perform poorly for high-dimensional data, and require a meaningful distance function to calculate similarity.

Examples include predicting employee churn, email spam, financial fraud, or student letter grades.

Predictions are mapped to be between 0 and 1 through the logistic function, which means that predictions can be interpreted as class probabilities.

The models themselves are still 'linear,' so they work well when your classes are linearly separable (i.e. they can be separated by a single decision surface).

To predict a new observation, you'd simply 'look up' the class probabilities in your 'probability table' based on its feature values.

However, we want to leave you with a few words of advice based on our experience: If you'd like to learn more about the applied machine learning workflow and how to efficiently train professional-grade models, we invite you to check out our Data Science Primer.

For more over-the-shoulder guidance, we also offer a comprehensive masterclass that further explains the intuition behind many of these algorithms and teaches you how to apply them to real-world problems.

Text Analytics - Ep. 25 (Deep Learning SIMPLIFIED)

Unstructured textual data is ubiquitous, but standard Natural Language Processing (NLP) techniques are often insufficient tools to properly analyze this data.

Time Limit Exceeded (TLE) - Learn this trick to pass all testcases in Competitive Coding !

Frustated with a TLE Error. Watch this video why does an Online Judge throws TLE and how to write algorithms that follow the given constraints what ...

Lecture 14 | Deep Reinforcement Learning

In Lecture 14 we move from supervised learning to reinforcement learning (RL), in which an agent must learn to interact with an environment in order to ...

11. Introduction to Machine Learning

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..

Lecture 13 | Generative Models

In Lecture 13 we move beyond supervised learning, and discuss generative modeling as a form of unsupervised learning. We cover the autoregressive ...

Jens Ludwig: "Machine Learning in the Criminal Justice System" | Talks at Google

Jens Ludwig, Director of the University of Chicago Crime Lab, talks about applying machine learning to reducing crime in Chicago and other public policy areas.

Man vs Machine Learning: Criminal Justice in the 21st Century | Jens Ludwig | TEDxPennsylvaniaAvenue

At any point in time America has over 700000 people in jail, drawn disproportionately from low-income and minority groups. We require judges to make ...

Lecture 15 | Efficient Methods and Hardware for Deep Learning

In Lecture 15, guest lecturer Song Han discusses algorithms and specialized hardware that can be used to accelerate training and inference of deep learning ...

Post-Quantum Zero-Knowledge and Signatures from Symmetric-Key

We propose a new class of post-quantum digital signature schemes that: (a) derive their security entirely from the security of symmetric-key primitives, believed ...

Practical Learning Algorithms for Structured Prediction

Machine learning techniques have been widely applied in many areas. In many cases, high accuracy requires training on large amount of data, adding more ...