# AI News, Model evaluation, model selection, and algorithm selection in machine learning

## Model evaluation, model selection, and algorithm selection in machine learning

Machine learning has become a central part of our life – as consumers, customers, and hopefully as researchers and practitioners!

Maybe we should address the previous question from another angle: “Why do we care about performance estimates at all?” Ideally, the estimated performance of a model tells how well it performs on unseen data – making predictions on future data is often the main problem we want to solve in applications of machine learning or the development of novel algorithms.

Let us summarize the main points why we evaluate the predictive performance of a model: Although these three sub-tasks listed above have all in common that we want to estimate the performance of a model, they all require different approaches.

However, if there’s one key take-away message from this article, it is that biased performance estimates are perfectly okay in model selection and algorithm selection if the bias affects all models equally.

0-1 loss and prediction accuracy In the following article, we will focus on the prediction accuracy, which is defined as the number of all correct predictions divided by the number of samples.

Or in more formal terms, we define the prediction accuracy ACC as where the prediction error ERR is computed as the expected value of the 0-1 loss over n samples in a dataset S: The 0-1 loss L(\cdot) is defined as where y_i is the ith true class label and \hat{y_i} the ith predicted class label, respectively.

Such a model maximizes the prediction accuracy or, vice versa, minimizes the probability, C(h), of making a wrong prediction where D is the generating distribution our data has been drawn from, \mathbf{x} is the feature vector of a sample with class label y.

Lastly, since we will mostly refer to the prediction accuracy (instead of the error) throughout this series of articles, we will use Dirac’s Delta function so that \delta\big( L(\hat{y_i}, y_i)\big) = 1 if \hat{y_i} = y_i and \delta\big( L(\hat{y_i}, y_i)\big) = 0 if \hat{y_i} \neq y_i.

Variance The variance is simply the statistical variance of the estimator \hat{\beta} and its expected value E[\hat{\beta}] The variance is a measure of the variability of our model’s predictions if we repeat the learning process multiple times with small fluctuations in the training set.

(On a side note, we can estimate this so called optimism bias as the difference between the training accuracy and the test accuracy.) Typically, the splitting of a dataset into training and test sets is a simple process of random subsampling.

the flower species are distributed uniformly: If our random function assigns 2/3 of the flowers (100) to the training set and 1/3 of the flowers (50) to the test set, it may yield the following: Assuming that the Iris dataset is representative of the true population (for instance, assuming that flowers are distributed uniformly in nature), we just created two imbalanced datasets with non-uniform class distributions.

Moreover, stratified sampling is incredibly easy to implement, and Ron Kohavi provides empirical evidence (Kohavi 1995) that stratification has a positive effect on the variance and bias of the estimate in k-fold cross-validation, a technique we will discuss later in this article.

Setting test data aside is our work-around for dealing with the imperfections of a non-ideal world, such as limited data and resources, and the inability to collect more data from the generating distribution.

Since hyperparameters are not learned during model fitting, we need some sort of “extra procedure” or “external loop” to optimize them separately – this holdout approach is ill-suited for the task.

Assuming that the algorithm could learn a better model from more data, we withheld valuable data that we set aside for estimating the generalization performance (i.e., the test dataset).

Certainly, a confidence interval around this estimate would not only be more informative and desirable in certain applications, but our point estimate could be quite sensitive to the particular training/test split (i.e., suffering from high variance).

Another, “simpler,” approach, which is often used in practice (although, I do not recommend it), may be using the familiar equation assuming a Normal Distribution to compute the confidence interval on the mean on a single training-test split under the central limit theorem.

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

So, we could now consider each prediction as a Bernoulli trial, and the number of correct predictions X is following a binomial distribution X \sim B(n, p) with n samples and k trials, where n \in \mathbb{N} \text{ and } p \in [0,1]: for k = 0, 1, 2, ..., n, where (Remember, p is the probability of success, and (1-p) is the probability of failure – a wrong prediction.) Now, the expected number of successes is computed as \mu = np, or more concretely, if the estimator has 50% success rate, we expect 20 out of 40 predictions to be correct.

The estimate has a variance of \sigma^2 = np(1-p) = 10 and a standard deviation of Since we are interested in the average number of successes, not its absolute value, we compute the variance of the accuracy estimate as and the respective standard deviation Under the normal approximation, we can then compute the confidence interval as where \alpha is the error quantile and z is the 1 - \frac{alpha}{2} quantile of a standard normal distribution.

## What is the Difference Between Test and Validation Datasets?

A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning model&#8217;s hyperparameters.

The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

In this section, we will take a look at how the train, test, and validation datasets are defined and how they differ according to some of the top machine learning texts and references.

It involves randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set.

In this example, they are clear to point out that the final model evaluation must be performed on a held out dataset that has not been used prior, either for training the model or tuning the model parameters.

The “training” data set is the general term for the samples used to create the model, while the “test” or “validation” data set is used to qualify performance.

The way to avoid this is to really hold the test set out—lock it away until you are completely done with learning and simply wish to obtain an independent evaluation of the final hypothesis.

Stuart Russell and Peter Norvig, page 709, Artificial Intelligence: A Modern Approach, 2009 (3rd edition) Importantly, Russell and Norvig comment that the training dataset used to fit the model can be further split into a training set and a validation set, and that it is this subset of the training dataset, called the validation set, that can be used to get an early estimate of the skill of the model.

If the test set is locked away, but you still want to measure performance on unseen data as a way of selecting a good hypothesis, then divide the available data (without the test set) into a training set and a validation set.

Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.

The crucial point is that a test set, by the standard definition in the NN [neural net] literature, is never used to choose among two or more networks, so that the error on the test set provides an unbiased estimate of the generalization error (assuming that the test set is representative of the population, etc.).

There are other ways of calculating an unbiased, (or progressively more biased in the case of the validation dataset) estimate of model skill on unseen data.

Max Kuhn and Kjell Johnson, Page 78, Applied Predictive Modeling, 2013 They go on to make a recommendation for small sample sizes of using 10-fold cross validation in general because of the desirable low bias and variance properties of the performance estimate.

## Model Selection and Train/Validation/Test Sets

In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome.

More importantly, you&#x27;ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.

The course will also draw from numerous case studies and applications, so that you&#x27;ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

## Cross-validation (statistics)

Cross-validation, sometimes called rotation estimation,[1][2][3] or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.

In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set).[4] The goal of cross-validation is to test the model’s ability to predict new data that were not used in estimating it, in order to flag problems like overfitting[citation needed] and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).

If we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data.

If we use least squares to fit a function in the form of a hyperplane y = a + βTx to the data (xi, yi)&#160;1&#160;≤&#160;i&#160;≤&#160;n, we could then assess the fit using the mean squared error (MSE).

The MSE for given estimated parameter values a and β on the training set (xi, yi)&#160;1&#160;≤&#160;i&#160;≤&#160;n is If the model is correctly specified, it can be shown under mild assumptions that the expected value of the MSE for the training set is (n&#160;−&#160;p&#160;−&#160;1)/(n&#160;+&#160;p&#160;+&#160;1)&#160;&lt;&#160;1 times the expected value of the MSE for the validation set[6] (the expected value is taken over the distribution of training sets).

Since in linear regression it is possible to directly compute the factor (n&#160;−&#160;p&#160;−&#160;1)/(n&#160;+&#160;p&#160;+&#160;1) by which the training MSE underestimates the validation MSE under the assumption that the model specification is valid, cross-validation can be used for checking whether the model has been overfitted, in which case the MSE in the validation set will substantially exceed its anticipated value.

Pseudo-Code-Algorithm: Input: x, {vector of length N with x-values of data points} y, {vector of length N with y-values of data points} Output: err, {estimate for the prediction error} Steps: err ← 0 for i ← 1, .

, y[N] x_out ← x[i] interpolate(x_in, y_in, x_out, y_out) err ← err + (y[i] − y_out)^2 end for err ← err/N Non-exhaustive cross validation methods do not compute all ways of splitting the original sample.

In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two).

In the holdout method, we randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively.

While the holdout method can be framed as 'the simplest kind of cross-validation',[8] many sources instead classify holdout as a type of simple validation, rather than a simple or degenerate form of cross-validation.[9][10] This method, also known as Monte Carlo cross-validation,[11] randomly splits the dataset into training and validation data.

When the value being predicted is continuously distributed, the mean squared error, root mean squared error or median absolute deviation could be used to summarize the errors.

The variance of F* can be large.[13][14] For this reason, if two statistical procedures are compared based on the results of cross-validation, it is important to note that the procedure with the better estimated performance may not actually be the better of the two procedures (i.e.

In some cases such as least squares and kernel regression, cross-validation can be sped up significantly by pre-computing certain values that are needed repeatedly in the training, or by using fast 'updating rules' such as the Sherman–Morrison formula.

An extreme example of accelerating cross-validation occurs in linear regression, where the results of cross-validation have a closed-form expression known as the prediction residual error sum of squares (PRESS).

For example, if a model for predicting stock values is trained on data for a certain five-year period, it is unrealistic to treat the subsequent five-year period as a draw from the same population.

New evidence is that cross-validation by itself is not very predictive of external validity, whereas a form of experimental validation known as swap sampling that does control for human bias can be much more predictive of external validity.[15] As defined by this large MAQC-II study across 30,000 models, swap sampling incorporates cross-validation in the sense that predictions are tested across independent training and validation samples.

When there is a mismatch in these models developed across these swapped training and validation samples as happens quite frequently, MAQC-II shows that this will be much more predictive of poor external predictive validity than traditional cross-validation.

In addition to placing too much faith in predictions that may vary across modelers and lead to poor external validity due to these confounding modeler effects, these are some other ways that cross-validation can be misused: Since the order of the data is important, cross-validation might be problematic for time-series models.

For example, suppose we are interested in optical character recognition, and we are considering using either support vector machines (SVM) or k nearest neighbors (KNN) to predict the true character from an image of a handwritten character.

If we simply compared the methods based on their in-sample error rates, the KNN method would likely appear to perform better, since it is more flexible and hence more prone to overfitting[citation needed] compared to the SVM method.

It forms the basis of the validation statistic, Vn which is used to test the statistical validity of meta-analysis summary estimates.[19] It has also been used in a more conventional sense in meta-analysis to estimate the likely prediction error of meta-analysis results.[20]

## What is the difference between test set and validation set?

choose one option among several options), you must have an additional set/partition to gauge the accuracy of your choice so that you do not simply pick the most favorable result of randomness and mistake the tail-end of the distribution for the center 1.

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc).

But, if you do not measure your top-performing algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Estimation, prediction, and evaluation of logistic regression models

I provide a practical introduction to using logistic regression for prediction (binary classification) using the Titanic data competition from Kaggle.com as an ...

R tutorial: Intro to Credit Risk Modeling

Learn more about credit risk modeling with R: Hi, and welcome to the first video of ..

Instrumental-variables regression using Stata®

Learn how to fit instrumental-variables models for endogenous covariates using -ivregress-. Created using Stata 13; largely applicable to Stata 14. Copyright ...

Predicting Stock Prices - Learn Python for Data Science #4

In this video, we build an Apple Stock Prediction script in 40 lines of Python using the scikit-learn library and plot the graph using the matplotlib library.

Data Science Demo - Customer Churn Analysis

This introduction to Data Science provides a demonstration of analyzing customer data to predict churn using the R programming language. MetaScale walks ...

Hypothesis testing and p-values | Inferential statistics | Probability and Statistics | Khan Academy

Hypothesis Testing and P-values Practice this yourself on Khan Academy right now: ...

Panel Data Models in Stata

Fixed Effects and Random Effects Models in Stata

Weka Tutorial 35: Creating Training, Validation and Test Sets (Data Preprocessing)

The tutorial that demonstrates how to create training, test and cross validation sets from a given dataset.

Random Projection Estimation of Discrete-Choice Models with Large Choice Sets

Matthew Shum of Cal Tech discusses the use of machine learning ideas in the estimation of discrete choice models, the workhorse model of demand in ...

Chi-squared Test

Paul Andersen shows you how to calculate the ch-squared value to test your null hypothesis. He explains the importance of the critical value and defines the ...