# AI News, Model evaluation, model selection, and algorithm selection in machine learning ## Model evaluation, model selection, and algorithm selection in machine learning

In the previous article (Part I), we introduced the general ideas behind model evaluation in supervised machine learning.

Also, we briefly introduced the normal approximation, where we make certain assumptions that allow us to compute confidence intervals for modeling the uncertainty of our performance estimate based on a single test set, which we have to take with a grain of salt.

To compute the classification error or accuracy on a dataset S, we defined the following equation Here, L(\cdot) represents the 0-1 loss, which we compute from the predicted class labels \hat{y}_i and the true labels y_i over all n samples in a dataset S: In essence, the classification error is simply the count of incorrect predictions divided by the number of samples in the dataset.

To use the resampling methods presented in the following sections, we simply need to swap the accuracy or error computation using the prediction accuracy or error by the mean squared error (MSE): As we learned in Part I, our performance estimates may suffer from bias and variance, and we are interested in finding a good trade-off.

Finally, even smaller subsets of the 3500-sample training set were produced via randomized, stratified splits, and I used these subsets to fit softmax classifiers and used the same 1500-sample test set to evaluate their performances;

One way to obtain a more robust performance estimate that is less variant to how we split the data into training and test sets is to repeat the holdout method k times with different random seeds and compute the average performance over these k repetitions where \text{ACC}_j is the accuracy estimate of the jth test set of size m, This repeated holdout procedure, sometimes also called Monte Carlo Cross-Validation, provides with a better estimate of how well our model may perform on a random test set, and it can also give us an idea about our model’s stability — how the model produced by a learning algorithm changes with different training set splits.

Second, we see a small increase in the pessimistic bias when we decrease the size of the training set — we withhold more training data in the 50/50 split, which may be the reason why the average performance over the 50 splits is slightly lower compared to the 90/10 splits.

As a side note, the term “bootstrap” likely originated from the phrase “to pull oneself up by one’s bootstraps:” Circa 1900, to pull (oneself) up by (one’s) bootstraps was used figuratively of an impossible task (Among the “practical questions” at the end of chapter one of Steele’s “Popular Physics” schoolbook (1888) is, “30.

By 1916 its meaning expanded to include “better oneself by rigorous, unaided effort.” The meaning “fixed sequence of instructions to load the operating system of a computer” (1953) is from the notion of the first-loaded program pulling itself, and the rest, up by the bootstrap.

In brief, the idea of the bootstrap method is to generate new data from a population by repeated sampling from the original dataset with replacement — in contrast, the repeated holdout method can be understood as sampling without replacement.

Here, we pick our lower and upper confidence bounds as follows: where \alpha_1 = \alpha and \alpha_2 = 1 - \alpha, and \alpha is our degree of confidence to compute the 100 \times (1 - 2 \times \alpha) confidence interval.

For instance, to compute a 95% confidence interval, we pick \alpha = 0.025 to obtain the 2.5th and 97.5th percentiles of the b bootstrap samples distribution as our upper and lower confidence bounds.

In the left subplot, I applied the Leave-One-Out Bootstrap technique to evaluate 3-nearest neighbors models on Iris, and the right subplot shows the results of the same model evaluation approach on MNIST, using the same softmax algorithm that we discussed earlier.

For instance, we can compute the probability that a given sample from a dataset of size n is not drawn as a bootstrap sample as which is asymptotically equivalent to \frac{1}{e} \approx 0.368 as n \rightarrow \infty.

Vice versa, we can then compute the probability that a sample is chosen as for reasonably large datasets, so that we’d select approximately 0.632 \times n uniques samples as bootstrap training sets and reserve 0.368 \times n out-of-bag samples for testing in each iteration.

Now, to address the bias that is due to this the sampling with replacement, Bradley Efron proposed the .632 Estimate that we mentioned earlier, which is computed via the following equation: where \text{ACC}_{r, i} is the resubstitution accuracy, and \text{ACC}_{h, i} is the accuracy on the out-of-bag sample.

Instead of using a fixed “weight” \omega = 0.632 in we compute the weight \gamma as where R is the relative overfitting rate (Since we are plugging \omega into the equation for computing ACC_{boot} that we defined above, \text{ACC}_{h, i} and \text{ACC}_{r, i} still refer to the resubstitution and out-of-bag accuracy estimates in the ith bootstrap round, respectively.) Further, we need to determine the no-information rate \gamma in order to compute R.

For instance, we can compute \gamma by fitting a model to a dataset that contains all possible combinations between samples x_{i'} and target class labels y_{i} — we pretend that the observations and class labels are independent: Alternatively, we can estimate the no-information rate \gamma as follows: where p_k is the proportion of class k samples observed in the dataset, and q_k is the proportion of class k samples that the classifier predicts in the dataset.

## Bootstrapping (statistics)

In statistics, bootstrapping is any test or metric that relies on random sampling with replacement.

Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods. Generally, it falls in the broader class of resampling methods.

In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset).

It is often used as an alternative to statistical inference based on the assumption of a parametric model when that assumption is in doubt, or where parametric inference is impossible or requires complicated formulas for the calculation of standard errors.

The bootstrap was published by Bradley Efron in 'Bootstrap methods: another look at the jackknife' (1979), inspired by earlier work on the jackknife. Improved estimates of the variance were developed later. A Bayesian extension was developed in 1981. The bias-corrected and accelerated (BCa) bootstrap was developed by Efron in 1987, and the ABC procedure in 1992. The basic idea of bootstrapping is that inference about a population from sample data, (sample → population), can be modelled by resampling the sample data and performing inference about a sample from resampled data, (resampled → sample).

More formally, the bootstrap works by treating inference of the true probability distribution J, given the original data, as being analogous to inference of the empirical distribution of Ĵ, given the resampled data.

The simplest bootstrap method involves taking the original data set of N heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size N.

we might 'resample' 5 times from [1,2,3,4,5] and get [2,5,4,4,1]), so, assuming N is sufficiently large, for all practical purposes there is virtually zero probability that it will be identical to the original 'real' sample.

This process is repeated a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples we compute its mean (each of these are called bootstrap estimates).

It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients.

Although for most problems it is impossible to know the true confidence interval, bootstrap is asymptotically more accurate than the standard intervals obtained using sample variance and assumptions of normality. Although bootstrapping is (under some conditions) asymptotically consistent, it does not provide general finite-sample guarantees.

Moreover, there is evidence that numbers of samples greater than 100 lead to negligible improvements in the estimation of standard errors. In fact, according to the original developer of the bootstrapping method, even setting the number of samples at 50 is likely to lead to fairly good standard error estimates. Adèr et al.

recommend the bootstrap procedure for the following situations: However, Athreya has shown that if one performs a naive bootstrap on the sample mean when the underlying population lacks a finite variance (for example, a power law distribution), then the bootstrap distribution will not converge to the same limit as the sample mean.

In univariate problems, it is usually acceptable to resample the individual observations with replacement ('case resampling' below) unlike subsampling, in which resampling is without replacement and is valid under much weaker conditions compared to the bootstrap.

Bootstrap comes in handy when there is no analytical form or normal theory to help estimate the distribution of the statistics of interest, since bootstrap method can apply to most random quantities, e.g., the ratio of variance and mean.

x

&#x00AF;

x

&#x00AF;

An example of the first resample might look like this X1* = x2, x1, x10, x10, x3, x4, x6, x7, x1, x9.

In regression problems, case resampling refers to the simple scheme of resampling individual cases - often rows of a data set.

D

D

are then interpretable as posterior distributions on that parameter. Under this scheme, a small amount of (usually normally distributed) zero-centered random noise is added onto each resampled observation.

In this case a parametric model is fitted to the data, often by maximum likelihood, and samples of random numbers are drawn from this fitted model.

The use of a parametric model at the sampling stage of the bootstrap methodology leads to procedures which are different from those obtained by applying basic statistical theory to inference for the same model.

Gaussian processes are methods from Bayesian non-parametric statistics but are here used to construct a parametric bootstrap approach, which implicitly allows the time-dependence of the data to be taken into account.

The idea is, like the residual bootstrap, to leave the regressors at their sample value, but to resample the response variable based on the residuals values.

This method assumes that the 'true' residual distribution is symmetric and can offer advantages over simple residual sampling for smaller sample sizes.

In the moving block bootstrap, introduced by Künsch (1989), data is split into n-b+1 overlapping blocks of length b: Observation 1 to b will be block 1, observation 2 to b+1 will be block 2 etc.

Other related modifications of the moving block bootstrap are the Markovian bootstrap and a stationary bootstrap method that matches subsequent blocks based on standard deviation matching.

The structure of the block bootstrap is easily obtained (where the block just corresponds to the group), and usually only the groups are resampled, while the observations within the groups are left unchanged.

The bootstrap distribution of a point estimator of a population parameter has been used to produce a bootstrapped confidence interval for the parameter's true value, if the parameter can be written as a function of the population's distribution.

Popular families of point-estimators include mean-unbiased minimum-variance estimators, median-unbiased estimators, Bayesian estimators (for example, the posterior distribution's mode, median, mean), and maximum-likelihood estimators.

the bootstrapping of a maximum-likelihood estimator may often be improved using transformations related to pivotal quantities. The bootstrap distribution of a parameter-estimator has been used to calculate confidence intervals for its population-parameter.[citation needed] There are several methods for constructing confidence intervals from the bootstrap distribution of a real parameter: See Davison and Hinkley (1997, equ.

It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function).

In other cases, the percentile bootstrap can be too narrow.[citation needed] When working with small sample sizes (i.e., less than 50), the percentile confidence intervals for (for example) the variance statistic will be too narrow.

according to Rice, 'Although this direct equation of quantiles of the bootstrap sampling distribution with confidence limits may seem initially appealing, it’s rationale is somewhat obscure.' The studentized test enjoys optimal properties as the statistic that is bootstrapped is pivotal (i.e.

In 1878, Simon Newcomb took observations on the speed of light. The data set contains two outliers, which greatly influence the sample mean.

(Note that the sample mean need not be a consistent estimator for any population mean, because no mean need exist for a heavy-tailed distribution.) A well-defined and robust statistic for central tendency is the sample median, which is consistent and median-unbiased for the population median.

A convolution method of regularization reduces the discreteness of the bootstrap distribution by adding a small amount of N(0, σ2) random noise to each bootstrap sample.

In this example, the bootstrapped 95% (percentile) confidence-interval for the population median is (26, 28.5), which is close to the interval for (25.98, 28.46) for the smoothed bootstrap.

In situations where an obvious statistic can be devised to measure a required characteristic using only a small number, r, of data items, a corresponding statistic based on the entire sample can be formulated.

## Resampling (statistics)

In statistics, resampling is any of a variety of methods for doing one of the following: Common resampling techniques include bootstrapping, jackknifing and permutation tests.

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.

It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inference is impossible or requires very complicated formulas for the calculation of standard errors.

Bootstrapping techniques are also used in the updating-selection transitions of particle filters, genetic type algorithms and related tesample/teconfiguration Monte Carlo methods used in computational physics and molecular chemistry. In this context, the bootstrap is used to replace sequentially empirical weighted probability measures by empirical measures.

Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic, when a random sample of observations is used to calculate it.

Historically this method preceded the invention of the bootstrap with Quenouille inventing this method in 1949 and Tukey extending it in 1958. This method was foreshadowed by Mahalanobis who in 1946 suggested repeated estimates of the statistic of interest with half the sample chosen at random. He coined the name 'interpenetrating samples' for this method.

Tukey extended this method by assuming that if the replicates could be considered identically and independently distributed, then an estimate of the variance of the sample parameter could be made and that it would be approximately distributed as a t variate with n−1 degrees of freedom (n being the sample size).

The basic idea behind the jackknife variance estimator lies in systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set.

The jackknife is consistent for the sample means, sample variances, central and non-central t-statistics (with possibly non-normal populations), sample coefficient of variation, maximum likelihood estimators, least squares estimators, correlation coefficients and regression coefficients.

In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi square distribution with two degrees of freedom.

Although there are huge theoretical differences in their mathematical insights, the main practical difference for statistics users is that the bootstrap gives different results when repeated on the same data, whereas the jackknife gives exactly the same result each time.

On the other hand, when this verification feature is not crucial and it is of interest not to have a number but just an idea of its distribution, the bootstrap is preferred (e.g., studies in physics, economics, biological sciences).

However, the bootstrap variance estimator is not as good as the jackknife or the balanced repeated replication (BRR) variance estimator in terms of the empirical results.

It should only be used with smooth, differentiable statistics (e.g., totals, means, proportions, ratios, odd ratios, regression coefficients, etc.;

More general jackknifes than the delete-1, such as the delete-m jackknife, overcome this problem for the medians and quantiles by relaxing the smoothness requirements for consistent variance estimation.

Complex sampling schemes may involve stratification, multiple stages (clustering), varying sampling weights (non-response adjustments, calibration, post-stratification) and under unequal-probability sampling designs.

Theoretical aspects of both the bootstrap and the jackknife can be found in Shao and Tu (1995), whereas a basic introduction is accounted in Wolter (2007). The bootstrap estimate of model prediction bias is more precise than jackknife estimates with linear models such as linear discriminant function or multiple regression. Subsampling is an alternative method for approximating the sampling distribution of an estimator.

in addition, the resample (or subsample) size must tend to infinity together with the sample size but at a smaller rate, so that their ratio converges to zero.

While subsampling was originally proposed for the case of independent and identically distributed (iid) data only, the methodology has been extended to cover time series data as well;

for example, such cases include examples where the rate of convergence of the estimator is not the square root of the sample size or when the limiting distribution is non-normal.

For comparison, in regression analysis methods such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is.

In contrast, the cross-validated mean-square error will tend to decrease if valuable predictors are added, but increase if worthless predictors are added. A

permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.

x

&#x00AF;

x

&#x00AF;

The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis H

0

Next, the difference in sample means is calculated and recorded for every possible way of dividing these pooled values into two groups of size

The set of these calculated differences is the exact distribution of possible differences under the null hypothesis that group label does not matter.

If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the recorded differences, and then observe if T(obs) is contained within the middle 95% of them.

In contrast to permutation tests, the reference distributions for many popular 'classical' statistical tests, such as the t-test, F-test, z-test, and χ2 test, are obtained from theoretical probability distributions.

For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher's exact test becomes more appropriate.

Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square).

All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption.

For example, it is possible in this manner to construct a permutation t-test, a permutation χ2 test of association, a permutation version of Aly's test for comparing variances and so on.

The major down-side to permutation tests are that they Permutation tests exist for any test statistic, regardless of whether or not its distribution is known.

Permutation tests can be used for analyzing unbalanced designs and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001).

Permutation tests may be ideal for analyzing quantitized data that do not satisfy statistical assumptions underlying traditional parametric tests (e.g., t-tests, ANOVA) (Collingridge, 2013).

Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations, made the application of permutation test methods practical for a wide range of problems.

It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based 'exact' confidence intervals.

Good (2005) explains the difference between permutation tests and bootstrap tests the following way: 'Permutations test hypotheses concerning distributions;

An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner.

This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates.

The earliest known reference to this approach is Dwass (1957). This type of permutation test is known under various names: approximate permutation test, Monte Carlo permutation tests or random permutation tests. After

p

&#x005E;

p

&#x005E;

or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either

Small sample size confidence intervals | Probability and Statistics | Khan Academy

Constructing small sample size confidence intervals using t-distributions Watch the next lesson: ...

Two-Sample Bootstrap Hypothesis Test for Population Mean

Two-Sample Bootstrap Hypothesis Test for Population Mean

Standard error of the mean | Inferential statistics | Probability and Statistics | Khan Academy

Standard Error of the Mean (a.k.a. the standard deviation of the sampling distribution of the sample mean!) Watch the next lesson: ...

R tutorial: Cross-validation

Learn more about machine learning with R: In the last video, we manually split our data into a ..

Monte Carlo Simulation for estimators: An Introduction

This video provides an introduction to Monte Carlo methods for evaluating the properties of estimators. Check out ...

Bootstrapping

Tutorial on the SPSS Bootstrapping module.

Bootstrapping in AMOS

Provides illustration of Bollen-Stine bootstrapping and bootstrapping of individual parameter estimates within a path model.

Bootstrap aggregating bagging

This video is part of the Udacity course "Machine Learning for Trading". Watch the full course at

What is BOOTSTRAPPING? What does BOOTSTRAPPING mean? BOOTSTRAPPING meaning, definition & explanation

What is BOOTSTRAPPING? What does BOOTSTRAPPING mean? BOOTSTRAPPING meaning -BOOTSTRAPPING pronunciation - BOOTSTRAPPING definition ...

15 Sampling Methods in Minitab

Sample from columns, with or without replacement. Stratified sampling.