AI News, Data Science, Machine Learning, and Statistics: what is in a name?

Data Science, Machine Learning, and Statistics: what is in a name?

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process.

The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms.

The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use.

group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers).

I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature.

There are many other potential data problems statistics describes well (like Simpson’s paradox), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets.

It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to GIGO).

Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).

Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.

This isn’t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like online suffix trees) are quite brilliant.

variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events.”

don’t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background.

The Wikipedia defines data science as a field that “incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.”

Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.

Machine learning

Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to 'learn' (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.[1]

These analytical models allow researchers, data scientists, engineers, and analysts to 'produce reliable, repeatable decisions and results' and uncover 'hidden insights' through learning from historical relationships and trends in the data.[12]

Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: 'A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.'[13]

Developmental learning, elaborated for robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.

Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval.[17]:708–710;

Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).

Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in knowledge discovery and data mining (KDD) the key task is the discovery of previously unknown knowledge.

Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by other supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data.

Loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in classification, one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set of examples).

The difference between the two fields arises from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine learning is concerned with minimizing the loss on unseen samples.[19]

The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

An artificial neural network (ANN) learning algorithm, usually called 'neural network' (NN), is a learning algorithm that is vaguely inspired by biological neural networks.

They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables.

Falling hardware prices and the development of GPUs for personal use in the last few years have contributed to the development of the concept of deep learning which consists of multiple hidden layers in an artificial neural network.

Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program that entails all positive and no negative examples.

Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to some predesignated criterion or criteria, while observations drawn from different clusters are dissimilar.

Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated for example by internal compactness (similarity between members of the same cluster) and separation between different clusters.

Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG).

Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions, allowing reconstruction of the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution.

Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features.

genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and uses methods such as mutation and crossover to generate new genotype in the hope of finding good solutions to a given problem.

In 2006, the online movie company Netflix held the first 'Netflix Prize' competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%.

Classification machine learning models can be validated by accuracy estimation techniques like the Holdout method, which splits the data into a training and test sets (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set.

In comparison, the k-fold-cross-validation method randomly splits the data into k subsets where the k - 1 instances of the data subsets are used to train the model while the kth subset instance is used to test the predictive ability of the training model.

For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[50][51]

There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these 'greed' biases are addressed.[53]

Glossary of common Machine Learning, Statistics and Data Science terms

In the coming days, we will add more terms related to data science, business intelligence and big data.

In the meanwhile, if you want to contribute to the glossary or want to request adding more terms, please feel free to let us know through comments below!

When we get an output as probabilities and have to classify them into classes, we decide some threshold value and if the probability is above that threshold value we classify it as 1, and 0 otherwise.

Vision is a field of computer science that deals with enabling computers to visualize, process and identify images/videos in the same way that a human vision does.

In the recent times, the major driving forces behind Computer Vision has been the emergence of deep learning, rise in computational power and a huge amount of image data.

The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner.

The number of concordant and discordant pairs are used in calculations for Kendall’s tau, which measures the association between two ordinal variables. Let’s

ranks given by the reviewer 1 are ordered in ascending order, this way we can compare the rankings given by both the reviewers.

confidence interval is used to estimate what percent of a population fits a category based on the results from a sample population.

For example, if 70 adults own a cell phone in a random sample of 100 adults, we can be fairly confident that the true percentage amongst the population is somewhere between 61% and 79%.

The 2nd quadrant is called type II error or False Negatives, whereas 3rd quadrant is called type I error or False positives Continuous

An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a specific value. Convex

real value function is called convex if the line segment between any two points on the graph of the function lies above or on the graph. Convex

Two parallel vectors have a cosine similarity of 1 and two vectors at 90° have a cosine similarity of 0.

Suppose we have two vectors A and B, cosine similarity of these vectors can be calculated by dividing the dot product of A and B with the product of the magnitude of the two vectors.

It’s similar to variance, but where variance tells you how a single variable varies, co variance tells you how two variables vary together.

positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related. Cross

information theory, the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural”

stopping is a technique for avoiding overfitting when training a machine learning model with iterative method.

We set the early stopping in such a way that when the performance has stopped improving on the held-out validation set, the model training stops.

Early stopping enables you to specify a validation dataset and the number of iterations after which the algorithm should stop if the score on your validation dataset didn’t increase. EDA EDA

or exploratory data analysis is a phase used for data science pipeline in which the focus is to understand insights of the data through visualization or by statistical analysis. The

enforces data quality and consistency standards Delivers data in a presentation-ready format This data can be used by application developers to build applications and end users for making decisions. Evaluation

purpose of evaluation metric is to measure the quality of the statistical / machine learning model.

analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors.

Without looking up the indices in an associative array, it applies a hash function to the features and uses their hash values as indices directly.

Selection is a process of choosing those features which are required to explain the predictive power of a statistical model and dropping out irrelevant features. This

learning refers to the training of machine learning algorithms using a very small set of training data instead of a very large set.

This is most suitable in the field of computer vision, where it is desirable to have an object categorization model work well without thousands of training examples. Flume Flume

It calculates the probability of an event in the long run of the experiment (i.e the experiment is repeated under the same conditions to obtain the outcome). Here,

For example, I perform an experiment with a stopping intention in mind that I will stop the experiment when it is repeated 1000 times or I see minimum 300 heads in a coin toss.

(or interquartile range) is a measure of variability based on dividing the rank-ordered data set into four equal parts.

For example, each iteration of training a neural network takes certain number of training data and updates the weights by using gradient descent or some other weight update rule.

data are usually more expensive to obtain than the raw unlabeled data because preparation of the labelled data involves manual labelling every piece of unlabeled data. Labeled

These charts are used to communicate information visually, such as to show an increase or decrease in the trend in data over intervals of time.

In the plot below, for each time instance, the speed trend is shown and the points are connected to display the trend over time. This

Line charts can also be used to compare changes over the same period of time for multiple cases, like plotting the speed of a cycle, car, train over time in the same plot. Linear

Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weight!

He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters.

coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line. Look

simple words, it predicts the probability of occurrence of an event by fitting data to a logistic function. Hence, it is also known as logistic regression.

artificial neuron, as in a multi-layer neural network, that is, they compute an activation (using an activation function) of a weighted sum.

is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple. NaN NaN

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

It can accommodate a wide variety of data models, including key-value, document, columnar and graph formats.

example, if you have a variable ranging from 0 to 1 and other from 0 to 1000, you can normalize the variable, such that both are in the range 0 to 1. Numpy NumPy

PWordDescriptionPandasPandas is an open source, high-performance, easy-to-use data structure and data analysis library for the Python programming language.

for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format. Flexible

computer vision, supervised pattern recognition techniques are used for optical character recognition (OCR), face detection, face recognition, object detection, and object classification. Pie

You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures. For

component analysis (PCA) is an approach to factor analysis that considers the total variance in the data, and transforms the original variables into a smaller set of linear combinations.

is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA

is an open source programming language, widely used for various applications, such as general purpose programming, data science and machine learning.

RWordDescription RR is an open-source programming language and a software environment for statistical computing, machine learning, and data visualization.

Recommendation engines basically are data filtering tools that make use of algorithms and data to recommend the most relevant items to a particular user.

If we can recommend items to a customer based on their needs and interests, it will create a positive effect on the user experience and they will visit more frequently.

In this technique, instead of building one model for the entire dataset, it is divided into multiple bins and a separate model is built on each bin. Read

is an example of machine learning where the machine is trained to take specific decisions based on the business requirement with the sole motto to maximize efficiency (performance).

The idea involved in reinforcement learning is: The machine/ software agent trains itself on a continual basis based on the environment it is exposed to, and applies it’s enriched knowledge to solve business problems.

An RL agent learns from its past experience, rather from its continual trial and error learning process as against supervised learning where an external supervisor  provides examples. A

Self driving cars use Reinforcement learning to make decisions continuously like which route to take, what speed to drive on, are some of the questions which are decided after interacting with the environment.

α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of squares of coefficients.

mathematics, a function defined on an inner product space is said to have rotational invariance if its value does not change when arbitrary rotations are applied to its argument.

It is used for building machine learning models for range of tasks in data science, mainly used for machine learning applications such as building neural networks.

Find the Mean of set of numbers, which is (600 + 470 + 170 + 430 + 300) / 5 = 394 2)

Mathematical Problems in Engineering

Recently, a number of short-term speed prediction approaches have been developed, in which most algorithms are based on machine learning and statistical theory.

This paper examined the multistep ahead prediction performance of eight different models using the 2-minute travel speed data collected from three Remote Traffic Microwave Sensors located on a southbound segment of 4th ring road in Beijing City.

Difference between Machine Learning & Statistical Modeling

One of the most common question, which gets asked at various data science forums is: What is the difference between Machine Learning and Statistical modeling?

When I came across this question at first, I found almost no clear answer which can layout how machine learning is different from statistical modeling.

Given the similarity in terms of the objective both try to solve for, the only difference lies in the volume of data involved and human involvement for building a model.

Here is an interesting Venn diagram on the coverage of machine learning and statistical modeling in the universe of data science (Reference: SAS institute)

The common objective behind using either of the tools is Learning from Data. Both these approaches aim to learn about the underlying phenomena by using data generated in the process.

Let us now see an interesting example published by McKinsey differentiating the two algorithms : Case : Understand the risk level of customers churn over a period of time for a Telecom company Data Available : Two Drivers –

Even with a laptop of 16 GB RAM I daily work on datasets of millions of rows with thousands of parameter and build an entire model in not more than 30 minutes.

Given the flavor of difference in output of these two approaches, let us understand the difference in the two paradigms, even though both do almost similar job : All the differences mentioned above do separate the two to some extent, but there is no hard boundary between Machine Learning and statistical modeling.

subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.

It came into existence in the 1990s as steady advances in digitization and cheap computing power enabled data scientists to stop building finished models and instead train computers to do so.

The unmanageable volume and complexity of the big data that the world is now swimming in have increased the potential of machine learning—and the need for it.

In a statistical model, we basically try to estimate the function f in Machine Learning takes away the deterministic function “f”

It simply becomes It will try to find pockets of X in n dimensions (where n is the number of attributes), where occurrence of Y is significantly different.

Statistical modeling and machine learning

This lecture for students in bioinformatics, physics, master students of Data Engineering and Analytics, and master students of Biomedical Computing will lay the theoretical and practical foundations for statistical data analysis, statistical modeling and machine learning from a Bayesian probabilistic perspective.

It will train you in "thinking probabilistically", which is extremely useful in many areas of quantitative sciences where noisy data need to be analyzed (and hence modeled!).

The lecture will be held in inverted classroom style: Each week, we will give a ~30 min overview of the next reading assignment of a section of the book, pointing out the essential messages, thus facilitating the reading at home.

Then, practical exercises using the newly acquired material will be solved in teams, using the R statistics framework (100min).

The inverted classroom style is in our experience better suited than the conventional lecturing model for quantitative topics that require the students to think through or retrace mathematical derivations at their own speed.

Math Basic linear algebra (matrix and vector algebra, inverse and transposed matrices, determinants, eigenvalue decomposition) and one-dimensional calculus (e.g.

Probability theory, Bayes theorem, conditional independence, distributions (multinomial, Poisson, Gaussian, gamma, beta,...), central limit theorem, entropy, mutual information 3.

Bayesian statistics: max posterior estimation, model selection, uninformative and robust priors, hierarchical and empirical Bayes, Bayesian decision theory 6.

StatQuest: What is a statistical model?

"Model" is a vague term that means different things in different contexts. Here I clear it all up in the context of statistics! If you'd like to support StatQuest, please ...

Statistics 03: Types of statistical models

In this lecture, I show which types of statistical models should be used when; the most important decision concerns the explanatory variables: When these are ...

ROC Curves and Area Under the Curve (AUC) Explained

Transcript and screenshots: Visualization: Research paper: .

Statistics 101: Simple Linear Regression, The Very Basics

This is the first video in what will be, or is (depending on when you are watching this) a multipart video series about Simple Linear Regression. In the next few ...

Approaches for Sequence Classification on Financial Time Series Data

Sequence classification tasks can be solved in a number of ways, including both traditional ML and deep learning methods. Catch Lauren Tran's talk at the ...


This video is part of the Udacity course "Machine Learning for Trading". Watch the full course at

11. Introduction to Machine Learning

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..

Machine Learning for Time Series Data in Python | SciPy 2016 | Brett Naul

The analysis of time series data is a fundamental part of many scientific disciplines, but there are few resources meant to help domain scientists to easily explore ...

A friendly introduction to Bayes Theorem and Hidden Markov Models

A friendly introduction to Bayes Theorem and Hidden Markov Models, with simple examples. No background knowledge needed, except basic probability.

Choosing which statistical test to use - statistics help

Seven different statistical tests and a process by which you can decide which to use. The tests are: Test for a mean, test for a proportion, difference of proportions ...