AI News, No, Machine Learning is not just glorified Statistics

No, Machine Learning is not just glorified Statistics

This meme has been all over social media lately, producing appreciative chuckles across the internet as the hype around deep learning begins to subside.

ML experts who in 2013 preached deep learning from the rooftops now use the term only with a hint of chagrin, preferring instead to downplay the power of modern neural networks lest they be associated with the scores of people that still seem to think that import keras is the leap for every hurdle, and that they, in knowing it, have some tremendous advantage over their competition.

While it’s true that deep learning has outlived its usefulness as a buzzword, as Yann LeCun put it, this overcorrection of attitudes has yielded an unhealthy skepticism about the progress, future, and usefulness of artificial intelligence.

Additionally, many models approximate what can generally be considered statistical functions: the softmax output of a classification model consists of logits, making the process of training an image classifier a logistic regression.

However, in order to correctly evaluate the powerful impact and potential of machine learning methods, it is important to first dismantle the misguided notion that modern developments in artificial intelligence are nothing more than age-old statistical techniques with bigger computers and better datasets.

When I was learning the ropes of machine learning, I was lucky enough to take a fantastic class dedicated to deep learning techniques that was offered as part of my undergraduate computer science program.

Yet, I was able to read and understand a paper on a state-of-the-art generative machine learning model, implement it from scratch, and generate quite convincing fake images of non-existent individuals by training it on the MS Celebs dataset.

Throughout the class, my fellow students and I successfully trained models for cancerous tissue image segmentation, neural machine translation, character-based text generation, and image style transfer, all of which employed cutting-edge machine learning techniques invented only in the past few years.

Information theory, in general, requires a strong understanding of data and probability, and I would certainly advise anyone interested in becoming a Data Scientist or Machine Learning Engineer to develop a deep intuition of statistical concepts.

It should also be acknowledged that many machine learning algorithms require a stronger background in statistics and probability than do most neural network techniques, but even these approaches are often referred to as statistical machine learning or statistical learning, as if to distinguish themselves from the regular, less statistical kind.

Again, in the real world, anyone hoping to do cool machine learning stuff is probably working on data problems of a variety of types, and therefore needs to have a strong understanding of statistics as well.

In neural networks, this usually means using some variant of stochastic gradient descent to update the weights and biases of your network according to some defined loss function.

Borrowing statistical terms like logistic regression do give us useful vocabulary to discuss our model space, but they do not redefine them from problems of optimization to problems of data understanding.

If you don’t believe me, try telling a statistician that your model was overfitting, and ask them if they think it’s a good idea to randomly drop half of your model’s 100 million parameters.

This has yielded considerable progress in fields such as computer vision, natural language processing, speech transcription, and has enabled huge improvement in technologies like face recognition, autonomous vehicles, and conversational AI.

It’s also true that the space shuttle was ultimately just a flying machine with wings, and yet we don’t see memes mocking the excitement around NASA’s 20th century space exploration as an overhyped rebranding of the airplane.

Machine learning

Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to 'learn' (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed.[2]

These analytical models allow researchers, data scientists, engineers, and analysts to 'produce reliable, repeatable decisions and results' and uncover 'hidden insights' through learning from historical relationships and trends in the data.[9]

Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: 'A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.'[10]

Developmental learning, elaborated for robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.

Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval.[14]:708–710;

Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).

Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in knowledge discovery and data mining (KDD) the key task is the discovery of previously unknown knowledge.

Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by other supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data.

Loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in classification, one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set of examples).

The difference between the two fields arises from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine learning is concerned with minimizing the loss on unseen samples.[16]

The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

An artificial neural network (ANN) learning algorithm, usually called 'neural network' (NN), is a learning algorithm that is vaguely inspired by biological neural networks.

They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables.

Falling hardware prices and the development of GPUs for personal use in the last few years have contributed to the development of the concept of deep learning which consists of multiple hidden layers in an artificial neural network.

Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program that entails all positive and no negative examples.

Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to some predesignated criterion or criteria, while observations drawn from different clusters are dissimilar.

Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated for example by internal compactness (similarity between members of the same cluster) and separation between different clusters.

Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG).

Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions, allowing reconstruction of the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution.

Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features.

genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and uses methods such as mutation and crossover to generate new genotype in the hope of finding good solutions to a given problem.

In 2006, the online movie company Netflix held the first 'Netflix Prize' competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%.

Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ('everything is a recommendation') and they changed their recommendation engine accordingly.[38]

Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[45]

Classification machine learning models can be validated by accuracy estimation techniques like the Holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set.

In comparison, the N-fold-cross-validation method randomly splits the data in k subsets where the k-1 instances of the data are used to train the model while the kth instance is used to test the predictive ability of the training model.

For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[62][63]

There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these 'greed' biases are addressed.[65]

Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences

Some of the smartest linguists in the world labored for years during the Cold War to create translation systems as a way to interpret Russian communications more easily.

The way we speak English more influenced by who invaded who hundreds of years ago than it is by someone sitting down and defining grammar rules.

After the failure of rule-based systems, new translation approaches were developed using models based on probability and statistics instead of grammar rules.

In the same way that the Rosetta Stone was used by scientists in the 1800s to figure out Egyptian hieroglyphs from Greek, computers can use parallel corpora to guess how to convert text from one language to another.

Here’s how it works: First, we break up our sentence into simple chunks that can each be easily translated: Next, we will translate each of these chunks by finding all the ways humans have translated those same chunks of words in our training data.

For example, it’s much more common for someone to say “Quiero” to mean “I want” than to mean “I try.” So we can use how frequently “Quiero” was translated to “I want” in our training data to give that translation more weight than a less frequent translation.

Just from the chunk translations we listed in Step 2, we can already generate nearly 2,500 different variations of our sentence by combining the chunks in different ways.

Here are some examples: But in a real-world system, there will be even more possible chunk combinations because we’ll also try different orderings of words and different ways of chunking the sentence: Now need to scan through all of these generated sentences to find the one that is that sounds the “most human.” To do this, we compare each generated sentence to millions of real sentences from books and news stories written in English.

In the early days, it was surprising to everyone that the “dumb” approach to translating based on probability worked better than rule-based systems designed by linguists.

If you are asking Google to translate Georgian to Telegu, it has to internally translate it into English as an intermediate step because there’s not enough Georgain-to-Telegu translations happening to justify investing heavily in that language pair.

The holy grail of machine translation is a black box system that learns how to translate by itself— just by looking at training data.

regular (non-recurrent) neural network is a generic machine learning algorithm that takes in a list of numbers and calculates a result (based on previous training).

For example, we can use a neural network to calculate the approximate value of a house based on attributes of that house: But like most machine learning algorithms, neural networks are stateless.

recurrent neural network (or RNN for short) is a slightly tweaked version of a neural network where the previous state of the neural network is one of the inputs to the next calculation.

You’re probably already familiar with this idea from watching any primetime detective show like CSI: The idea of turning a face into a list of measurements is an example of an encoding.

We can come up with an encoding that represents every possible different sentence as a series of unique numbers: To generate this encoding, we’ll feed the sentence into the RNN, one word at time.

The final result after the last word is processed will be the values that represent the entire sentence: Great, so now we have a way to represent an entire sentence as a set of unique numbers!

Difference between Machine Learning & Statistical Modeling

One of the most common question, which gets asked at various data science forums is: What is the difference between Machine Learning and Statistical modeling?

When I came across this question at first, I found almost no clear answer which can layout how machine learning is different from statistical modeling.

Given the similarity in terms of the objective both try to solve for, the only difference lies in the volume of data involved and human involvement for building a model.

Here is an interesting Venn diagram on the coverage of machine learning and statistical modeling in the universe of data science (Reference: SAS institute)

The common objective behind using either of the tools is Learning from Data. Both these approaches aim to learn about the underlying phenomena by using data generated in the process.

Let us now see an interesting example published by McKinsey differentiating the two algorithms : Case : Understand the risk level of customers churn over a period of time for a Telecom company Data Available : Two Drivers –

Even with a laptop of 16 GB RAM I daily work on datasets of millions of rows with thousands of parameter and build an entire model in not more than 30 minutes.

Given the flavor of difference in output of these two approaches, let us understand the difference in the two paradigms, even though both do almost similar job : All the differences mentioned above do separate the two to some extent, but there is no hard boundary between Machine Learning and statistical modeling.

subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.

It came into existence in the 1990s as steady advances in digitization and cheap computing power enabled data scientists to stop building finished models and instead train computers to do so.

The unmanageable volume and complexity of the big data that the world is now swimming in have increased the potential of machine learning—and the need for it.

In a statistical model, we basically try to estimate the function f in Machine Learning takes away the deterministic function “f”

It simply becomes It will try to find pockets of X in n dimensions (where n is the number of attributes), where occurrence of Y is significantly different.

Machine Learning

Supervised learning algorithms are trained using labeled examples, such as an input where the desired output is known.

The learning algorithm receives a set of inputs along with the corresponding correct outputs, and the algorithm learns by comparing its actual output with correct outputs to find errors.

Through methods like classification, regression, prediction and gradient boosting, supervised learning uses patterns to predict the values of the label on additional unlabeled data.

Popular techniques include self-organizing maps, nearest-neighbor mapping, k-means clustering and singular value decomposition.

Difference between Classification and Regression - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

Essential Tools for Machine Learning - MATLAB Video

See what's new in the latest release of MATLAB and Simulink: Download a trial: Machine learning is quickly .

Mathematics of Machine Learning

Do you need to know math to do machine learning? Yes! The big 4 math disciplines that make up machine learning are linear algebra, probability theory, ...

The Learning Problem :: What Is Machine Learning @ Machine Learning Foundations (機器學習基石)

Machine Learning Made Easy

Get a Free Trial: Download a free Machine Learning with MATLAB Ebook: Download the Example code used in .

Machine Learning - Supervised VS Unsupervised Learning

Enroll in the course for free at: Machine Learning can be an incredibly beneficial tool to ..

UW - MSR Machine Learning workshop 2015 - Session 2

11:00 Sure Screening for Guassian Graphical Models - Daniela Witten In an undirected graphical model, the nodes represent random variables, and an edge ...

Machine Learning: Inference for High-Dimensional Regression

At the Becker Friedman Institute's machine learning conference, Larry Wasserman of Carnegie Mellon University discusses the differences between machine ...

Machine Learning and Machine Translation

Google Tech Talks October 22, 2008 ABSTRACT In this talk I'll outline our work at the University of Edinburgh to model machine translation (MT) as a ...