AI News, Model evaluation, model selection, and algorithm selection in machine learning

Model evaluation, model selection, and algorithm selection in machine learning

Machine learning has become a central part of our life – as consumers, customers, and hopefully as researchers and practitioners!

Maybe we should address the previous question from another angle: “Why do we care about performance estimates at all?” Ideally, the estimated performance of a model tells how well it performs on unseen data – making predictions on future data is often the main problem we want to solve in applications of machine learning or the development of novel algorithms.

Let us summarize the main points why we evaluate the predictive performance of a model: Although these three sub-tasks listed above have all in common that we want to estimate the performance of a model, they all require different approaches.

However, if there’s one key take-away message from this article, it is that biased performance estimates are perfectly okay in model selection and algorithm selection if the bias affects all models equally.

0-1 loss and prediction accuracy In the following article, we will focus on the prediction accuracy, which is defined as the number of all correct predictions divided by the number of samples.

Or in more formal terms, we define the prediction accuracy ACC as where the prediction error ERR is computed as the expected value of the 0-1 loss over n samples in a dataset S: The 0-1 loss L(\cdot) is defined as where y_i is the ith true class label and \hat{y_i} the ith predicted class label, respectively.

Such a model maximizes the prediction accuracy or, vice versa, minimizes the probability, C(h), of making a wrong prediction where D is the generating distribution our data has been drawn from, \mathbf{x} is the feature vector of a sample with class label y.

Lastly, since we will mostly refer to the prediction accuracy (instead of the error) throughout this series of articles, we will use Dirac’s Delta function so that \delta\big( L(\hat{y_i}, y_i)\big) = 1 if \hat{y_i} = y_i and \delta\big( L(\hat{y_i}, y_i)\big) = 0 if \hat{y_i} \neq y_i.

Variance The variance is simply the statistical variance of the estimator \hat{\beta} and its expected value E[\hat{\beta}] The variance is a measure of the variability of our model’s predictions if we repeat the learning process multiple times with small fluctuations in the training set.

(On a side note, we can estimate this so called optimism bias as the difference between the training accuracy and the test accuracy.) Typically, the splitting of a dataset into training and test sets is a simple process of random subsampling.

the flower species are distributed uniformly: If our random function assigns 2/3 of the flowers (100) to the training set and 1/3 of the flowers (50) to the test set, it may yield the following: Assuming that the Iris dataset is representative of the true population (for instance, assuming that flowers are distributed uniformly in nature), we just created two imbalanced datasets with non-uniform class distributions.

Moreover, stratified sampling is incredibly easy to implement, and Ron Kohavi provides empirical evidence (Kohavi 1995) that stratification has a positive effect on the variance and bias of the estimate in k-fold cross-validation, a technique we will discuss later in this article.

Setting test data aside is our work-around for dealing with the imperfections of a non-ideal world, such as limited data and resources, and the inability to collect more data from the generating distribution.

Since hyperparameters are not learned during model fitting, we need some sort of “extra procedure” or “external loop” to optimize them separately – this holdout approach is ill-suited for the task.

Assuming that the algorithm could learn a better model from more data, we withheld valuable data that we set aside for estimating the generalization performance (i.e., the test dataset).

Certainly, a confidence interval around this estimate would not only be more informative and desirable in certain applications, but our point estimate could be quite sensitive to the particular training/test split (i.e., suffering from high variance).

Another, “simpler,” approach, which is often used in practice (although, I do not recommend it), may be using the familiar equation assuming a Normal Distribution to compute the confidence interval on the mean on a single training-test split under the central limit theorem.

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

So, we could now consider each prediction as a Bernoulli trial, and the number of correct predictions X is following a binomial distribution X \sim B(n, p) with n samples and k trials, where n \in \mathbb{N} \text{ and } p \in [0,1]: for k = 0, 1, 2, ..., n, where (Remember, p is the probability of success, and (1-p) is the probability of failure – a wrong prediction.) Now, the expected number of successes is computed as \mu = np, or more concretely, if the estimator has 50% success rate, we expect 20 out of 40 predictions to be correct.

The estimate has a variance of \sigma^2 = np(1-p) = 10 and a standard deviation of Since we are interested in the average number of successes, not its absolute value, we compute the variance of the accuracy estimate as and the respective standard deviation Under the normal approximation, we can then compute the confidence interval as where \alpha is the error quantile and z is the 1 - \frac{alpha}{2} quantile of a standard normal distribution.

Estimation, prediction, and evaluation of logistic regression models

I provide a practical introduction to using logistic regression for prediction (binary classification) using the Titanic data competition from Kaggle.com as an ...

Estimating and Validating Models

Get a Free Trial: Get Pricing Info: Ready to Buy: Estimate multiple models and validate against

Instrumental-variables regression using Stata®

Learn how to fit instrumental-variables models for endogenous covariates using -ivregress-. Created using Stata 13; largely applicable to Stata 14. Copyright ...

Random Projection Estimation of Discrete-Choice Models with Large Choice Sets

Matthew Shum of Cal Tech discusses the use of machine learning ideas in the estimation of discrete choice models, the workhorse model of demand in ...

Predicting Stock Prices - Learn Python for Data Science #4

In this video, we build an Apple Stock Prediction script in 40 lines of Python using the scikit-learn library and plot the graph using the matplotlib library.

Panel Data Models in Stata

Fixed Effects and Random Effects Models in Stata

Hypothesis testing and p-values | Inferential statistics | Probability and Statistics | Khan Academy

Hypothesis Testing and P-values Practice this yourself on Khan Academy right now: ...

Matching a Weibull Distribution to a Data Set in Excel

This video was created for Penn State's course AERSP 880: Wind Turbine Systems, by Susan Stewart and the Department of Aerospace Engineering ...

Chi-squared Test

Paul Andersen shows you how to calculate the ch-squared value to test your null hypothesis. He explains the importance of the critical value and defines the ...

Lecture 13 - Validation

Validation - Taking a peek out of sample. Model selection and data contamination. Cross validation. Lecture 13 of 18 of Caltech's Machine Learning Course - CS ...