# AI News, What&#8217;s the difference between machine learning, statistics, and data mining?

## What&#8217;s the difference between machine learning, statistics, and data mining?

Over the last few blog posts, I&#8217;ve discussed some of the basics of what machine learning is and why it&#8217;s important: &#8211;

How to get started in machine learning in R Throughout those posts, I&#8217;ve been using the following definition of machine learning: creating computational systems that learn from data in order to make predictions and inferences.

And if you talk to someone who works in data-mining, you&#8217;ll hear the same thing: data mining is about using data to make predictions and draw conclusions from data.

The long answer has a bit of nuance (which we&#8217;ll discuss soon), but the short answer answer is very simple: machine learning, statistical learning, and data mining are almost exactly the same.

With that in mind, let&#8217;s review the major similarities as well as the differences, both to prove the point that they really are extremely similar, and also to give you the more nuanced view.

If you examine the contents a bit more closely, you&#8217;ll see sections concerning linear regression, logistic regression, the bias-variance tradeoff, neural networks, and support vector machines.

So to summarize, the material covered in machine learning, statistics (and statistical learning in particular), and data mining are so similar that they&#8217;re nearly identical.

The best answer here is that even though they use the same methods, they evolved as different cultures, so they have different histories, nomenclature, notation, and philosophical perspectives.

Much like human twins that have different groups of friends, people in ML, stats, and data mining also have different and separate social groups: they exist in different academic departments (in most universities), typically publish in different journals, and have different conferences and events.

A different way of saying this, is that although they use almost exactly the same methods, tools, and techniques, these three fields have philosophical differences concerning how and when those methods should be applied.

In his blog post, Wasserman mentioned these different emphases, noting that &#8220;statistics emphasizes formal statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems.&#8221;

(To be clear, in this quote, Wasserman seems to be talking about statistics broadly, and not statistical learning in particular.) Wasserman went on to note that machine learning is more focused on making accurate predictions – a sentiment echoed by Professor Andrew Gelman of Columbia University.

Actually, I&#8217;ll go a step further and state that machine learning isn&#8217;t just more focused on making predictions, but is more focused on building software systems that make predictions.

historically, machine learning developed because computer scientists needed a way to create computer programs that learn from data.

In a mining operation – for example a gold mining operation – large piles of dirt and material are extracted from the mine and then the miners sift through the dirt to find nuggets of gold.

Professor Rob Tibshirani – one of the authors of the excellent book An Introduction to Statistical Learning – created a glossary comparing several major terms in machine learning vs statistics.

Although most people would consider Ng a member of the &#8220;machine learning culture&#8221;, but he readily uses the terms attributed to the &#8220;statistical learning culture.&#8221;

There are some very rough generalizations we can make about tool choices between statisticians, data miners, and machine learning practitioners (but as I&#8217;ve pointed out several times, these are sort of hasty generalizations).

Moreover, as ML, stats, and data mining practitioners begin to cooperate and these fields begin to converge, you&#8217;re seeing people learn several tools.

In both academia and industry, you&#8217;ll find statisticians, ML experts, and data miners using a broad array of other technologies like SAS, SPSS, c++, Java and others.

Andrew Gelman – who is also a very well respected professor of statistics – similarly implies that statistics emphasizes smaller scale problems: &#8220;My impression is that computer scientists work on more complicated problems and bigger datasets, in general, than statisticians do.&#8221;

commonly work with databases that have millions, even hundreds of millions or billions of observations (although, it&#8217;s quite common to subset these large datasets down, to take samples, etc).

What this means for you, if you&#8217;re getting started with data science, is that you can safely treat machine learning, statistics, and data mining as &#8220;the same.&#8221;

## Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

How to get started in machine learning in R Throughout those posts, I’ve been using the following definition of machine learning: creating computational systems that learn from data in order to make predictions and inferences.

And if you talk to someone who works in data-mining, you’ll hear the same thing: data mining is about using data to make predictions and draw conclusions from data.

The long answer has a bit of nuance (which we’ll discuss soon), but the short answer answer is very simple: machine learning, statistical learning, and data mining are almost exactly the same.

They are … concerned with the same question: how do we learn from data?” To be clear, Wasserman was specifically addressing the difference between “machine learning” and “statistics” (he didn’t mention data mining).

If you examine the contents a bit more closely, you’ll see sections concerning linear regression, logistic regression, the bias-variance tradeoff, neural networks, and support vector machines.

So to summarize, the material covered in machine learning, statistics (and statistical learning in particular), and data mining are so similar that they’re nearly identical.

Much like human twins that have different groups of friends, people in ML, stats, and data mining also have different and separate social groups: they exist in different academic departments (in most universities), typically publish in different journals, and have different conferences and events.

A different way of saying this, is that although they use almost exactly the same methods, tools, and techniques, these three fields have philosophical differences concerning how and when those methods should be applied.

In his blog post, Wasserman mentioned these different emphases, noting that “statistics emphasizes formal statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems.” (To be clear, in this quote, Wasserman seems to be talking about statistics broadly, and not statistical learning in particular.) Wasserman went on to note that machine learning is more focused on making accurate predictions – a sentiment echoed by Professor Andrew Gelman of Columbia University.

This greater emphasis on systems (i.e., computer programs that learn from data) leads some people to argue that ML is more of an “engineering discipline” whereas statistics is more of a “mathematical” discipline.

In a mining operation – for example a gold mining operation – large piles of dirt and material are extracted from the mine and then the miners sift through the dirt to find nuggets of gold.

Professor Rob Tibshirani – one of the authors of the excellent book An Introduction to Statistical Learning – created a glossary comparing several major terms in machine learning vs statistics.

Although most people would consider Ng a member of the “machine learning culture”, but he readily uses the terms attributed to the “statistical learning culture.” So, it appears that even though there are slight cultural differences, even those differences are insubstantial;

There are some very rough generalizations we can make about tool choices between statisticians, data miners, and machine learning practitioners (but as I’ve pointed out several times, these are sort of hasty generalizations).

Wasserman noted in his blog post that in contrast to statistics, machine learning emphasizes “high dimension” prediction problems (which presumably means that machine learning emphasizes problems with a larger number of predictor variables).

Andrew Gelman – who is also a very well respected professor of statistics – similarly implies that statistics emphasizes smaller scale problems: “My impression is that computer scientists work on more complicated problems and bigger datasets, in general, than statisticians do.” And finally, data mining also emphasizes large scale data.

While I won’t offer any quotes here, I can say from personal experience that people who identify as “data miners” commonly work with databases that have millions, even hundreds of millions or billions of observations (although, it’s quite common to subset these large datasets down, to take samples, etc).

What this means for you, if you’re getting started with data science, is that you can safely treat machine learning, statistics, and data mining as “the same.” Learn the “surface level” differences so you can communicate with people from the “different cultures,” but ultimately, treat machine learning, statistics, and data mining each as subjects that you can learn from as you work to master data science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## The Two Cultures: statistics vs. machine learning?

If somebody claims a particular estimator is an unbiased estimator for $\theta$, then we try many values of $\theta$ in turn, generate many samples from each based on some assumed model, push them through the estimator, and find the average estimated $\theta$.

If we can prove that the expected estimate equals the true value, for all values, then we say it's unbiased.'

The empirical data you use might have all sorts of problems with it, and might not behave according the model we agreed upon for evaluation.'

While your method might have worked on one dataset (the dataset with train and test data) that you used in your evaluation, I can prove that mine will always work.'

Your 'proof' is only valid if the entire dataset behaves according to the model you assumed.'

I'd love to step in and balance things up, perhaps demonstrating some other issues, but I really love watching my frequentist colleague squirm.'

Whereas I will do an evaluation that is more general (because it involves a broadly-applicable proof) and also more limited (because I don't know if your dataset is actually drawn from the modelling assumptions I use while designing my evaluation.)' ML: 'What evaluation do you use, B?'

Then we can use the idea that none of us care what's in the black box, we care only about different ways to evaluate.'

The frequentist will calculate these for each blood testing method that's under consideration and then recommend that we use the test that got the best pair of scores.'

They will want to know 'of those that get a Positive result, how many will get Sick?' and 'of those that get a Negative result, how many are Healthy?' ' ML: 'Ah yes, that seems like a better pair of questions to ask.'

One option is to run the tests on lots of people and just observe the relevant proportions.

Your 'proven' coverage probabilities won't stack up in the real world unless all your assumptions stand up.

You call me crazy, yet you pretend your assumptions are the work of a conservative, solid, assumption-free analysis.'

But the interesting thing is that, once we decide on this form of evaluation, and once we choose our prior, we have an automatic 'recipe' to create an appropriate estimator.

If he wants an unbiased estimator for a complex model, he doesn't have any automated way to build a suitable estimator.'

I don't have an automatic way to create an unbiased estimator, because I think bias is a bad way to evaluate an estimator.

But given the conditional-on-data estimation that I like, and the prior, I can connect the prior and the likelihood to give me the estimator.'

We all have different ways to evaluate our methods, and we'll probably never agree on which methods are best.'

And some 'frequentist' proofs might be fun too, predicting the performance under some presumed model of data generation.'

Sometimes, you have great difficulty finding unbiased estimators, and even when you do you have a stupid estimator (for some really complex model) that will say the variance is negative.

ML: 'The lesson here is that, while we disagree a little on evaluation, none of us has a monopoly on how to create estimator that have properties we want.'

## The 10 Statistical Techniques Data Scientists Need to Master

Regardless of where you stand on the matter of Data Science sexiness, it’s simply impossible to ignore the continuing importance of data, and our ability to analyze, organize, and contextualize it.

With technologies like Machine Learning becoming ever-more common place, and emerging fields like Deep Learning gaining significant traction amongst researchers and engineers — and the companies that hire them — Data Scientists continue to ride the crest of an incredible wave of innovation and technological progress.

As Josh Wills put it, “data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” I personally know too many software engineers looking to transition into data scientist and blindly utilizing machine learning frameworks such as TensorFlow or Apache Spark to their data without a thorough understanding of statistical theories behind them.

Now being exposed to the content twice, I want to share the 10 statistical techniques from the book that I believe any data scientists should learn to be more effective in handling big datasets.

I wrote one of the most popular Medium posts on machine learning before, so I am confident I have the expertise to justify these differences: In statistics, linear regression is a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable.

Now I need to answer the following questions: Classification is a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.

Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Types of questions that a logistic regression can examine: In Discriminant Analysis, 2 or more groups or clusters or populations are known a priori and 1 or more new observations are classified into 1 of the known populations based on the measured characteristics.

Discriminant analysis models the distribution of the predictors X separately in each of the response classes, and then uses Bayes’ theorem to flip these around into estimates for the probability of the response category given the value of X.

In other words, the method of resampling does not involve the utilization of the generic distribution tables in order to compute approximate p probability values.

In order to understand the concept of resampling, you should understand the terms Bootstrapping and Cross-Validation: Usually for linear models, ordinary least squares is the major criteria to be considered to fit them into the data.

This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.

So far, we only have discussed supervised learning techniques, in which the groups are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to.

Below is the list of most widely used unsupervised learning algorithms: This was a basic run-down of some basic statistical techniques that can help a data science program manager and or executive have a better understanding of what is running underneath the hood of their data science teams.

## Supervised learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.[1] It infers a function from labeled training data consisting of a set of training examples.[2] In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias).

There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).

The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[4] Generally, there is a tradeoff between bias and variance.

A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

The second issue is the amount of training data available relative to the complexity of the 'true' function (classifier or regression function).

If the true function is simple, then an 'inflexible' learning algorithm with high bias and low variance will be able to learn it from a small amount of data.

But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a 'flexible' learning algorithm with low bias and high variance.

If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features.

Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias.

In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function.

In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones.

This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

fourth issue is the degree of noise in the desired output values (the supervisory target variables).

If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples.

You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model.

In such a situation, the part of the target function that cannot be modeled 'corrupts' your training data - this phenomenon has been called deterministic noise.

In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm.

There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.[5][6] Other factors to consider when choosing and applying a learning algorithm include the following: When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation).

Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.

x

1

y

1

x

N

y

N

x

i

y

i

R

arg

&#x2061;

max

y

|

For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.

empirical risk minimization and structural risk minimization.[7] Empirical risk minimization seeks the function that best fits the training data.

In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs,

x

i

y

i

R

&#x2265;

0

x

i

y

i

y

&#x005E;

y

i

y

&#x005E;

This can be estimated from the training data as In empirical risk minimization, the supervised learning algorithm seeks the function

|

y

&#x005E;

|

contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization.

The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.

&#x2211;

j

j

2

2

1

&#x2211;

j

|

j

0

j

The training methods described above are discriminative training methods, because they seek to find a function

i

i

i

11. Introduction to Machine Learning

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..

What is a Neural Network - Ep. 2 (Deep Learning SIMPLIFIED)

With plenty of machine learning tools currently available, why would you ever choose an artificial neural network over all the rest? This clip and the next could ...

Data Analysis vs Data Analytics(Data Science)

Data Science is the combination of statistics, mathematics, programming, problem solving, capturing data in ingenious ways, the ability to look at things ...

Shape Map Visualization - Power BI Visual Techniques

VIEW FULL WORKSHOP - In this tutorial I run through how to effectively use the shape map ..

8. Time Series Analysis I

MIT 18.S096 Topics in Mathematics with Applications in Finance, Fall 2013 View the complete course: Instructor: Peter ..

Lecture 01 - The Learning Problem

The Learning Problem - Introduction; supervised, unsupervised, and reinforcement learning. Components of the learning problem. Lecture 1 of 18 of Caltech's ...

Recent discussion in the public sphere about classification by algorithms has involved tension between competing notions of what it means for such a ...

Practical data science for a DBA with SQL Server 2017 and Cortana Intelligence - BRK3083

Is the data science/AI hype relevant to DBAs? Yes! Hear Rafal, a data mining veteran, explain newest opportunities uniquely aimed at those who really ...

16. Learning: Support Vector Machines

MIT 6.034 Artificial Intelligence, Fall 2010 View the complete course: Instructor: Patrick Winston In this lecture, we explore support ..

Data Analysis vs Data Analytics(Data Science) in Tamil

Data Science is the combination of statistics, mathematics, programming, problem solving, capturing data in ingenious ways, the ability to look at things ...