# AI News, The Two Cultures: statistics vs. machine learning?

## The Two Cultures: statistics vs. machine learning?

If somebody claims a particular estimator is an unbiased estimator for $\theta$, then we try many values of $\theta$ in turn, generate many samples from each based on some assumed model, push them through the estimator, and find the average estimated $\theta$.

If we can prove that the expected estimate equals the true value, for all values, then we say it's unbiased.'

The empirical data you use might have all sorts of problems with it, and might not behave according the model we agreed upon for evaluation.'

While your method might have worked on one dataset (the dataset with train and test data) that you used in your evaluation, I can prove that mine will always work.'

Your 'proof' is only valid if the entire dataset behaves according to the model you assumed.'

I'd love to step in and balance things up, perhaps demonstrating some other issues, but I really love watching my frequentist colleague squirm.'

Whereas I will do an evaluation that is more general (because it involves a broadly-applicable proof) and also more limited (because I don't know if your dataset is actually drawn from the modelling assumptions I use while designing my evaluation.)' ML: 'What evaluation do you use, B?'

Then we can use the idea that none of us care what's in the black box, we care only about different ways to evaluate.'

The frequentist will calculate these for each blood testing method that's under consideration and then recommend that we use the test that got the best pair of scores.'

They will want to know 'of those that get a Positive result, how many will get Sick?' and 'of those that get a Negative result, how many are Healthy?' ' ML: 'Ah yes, that seems like a better pair of questions to ask.'

One option is to run the tests on lots of people and just observe the relevant proportions.

Your 'proven' coverage probabilities won't stack up in the real world unless all your assumptions stand up.

You call me crazy, yet you pretend your assumptions are the work of a conservative, solid, assumption-free analysis.'

But the interesting thing is that, once we decide on this form of evaluation, and once we choose our prior, we have an automatic 'recipe' to create an appropriate estimator.

If he wants an unbiased estimator for a complex model, he doesn't have any automated way to build a suitable estimator.'

I don't have an automatic way to create an unbiased estimator, because I think bias is a bad way to evaluate an estimator.

But given the conditional-on-data estimation that I like, and the prior, I can connect the prior and the likelihood to give me the estimator.'

We all have different ways to evaluate our methods, and we'll probably never agree on which methods are best.'

And some 'frequentist' proofs might be fun too, predicting the performance under some presumed model of data generation.'

Sometimes, you have great difficulty finding unbiased estimators, and even when you do you have a stupid estimator (for some really complex model) that will say the variance is negative.

ML: 'The lesson here is that, while we disagree a little on evaluation, none of us has a monopoly on how to create estimator that have properties we want.'

## An Empirical Evaluation of Alternative Methods of Estimation for Confirmatory Factor Analysis With Ordinal Data

We used a comprehensive simulation study to empirically test our set of theoretically generated research hypotheses pertaining to the performance of CFA for ordinal data with polychoric correlations using both full WLS (as per Browne, 1984, and B.

First, as predicted, we replicated Quiroga&#x02019;s (1992) findings that polychoric correlations among ordinal variables accurately estimated the bivariate relations among normally distributed latent response variables and that modest violation of normality for latent response variables of a degree that might be expected in applied research leads to only slightly biased estimates of polychoric correlations.

As we described earlier, it has been analytically demonstrated that when CFA models are fitted using observed polychoric correlation matrices, full WLS estimation produces asymptotically correct chi-square tests of model fit and parameter standard errors (e.g., B.

Nonetheless, other studies applying full WLS estimation to the analysis of polychoric correlations have found rates of nonconvergence and improper solutions similar to those reported here: Neither Dolan (1994) nor Potthast (1993) obtained nonpositive definite weight matrices or improper solutions for any replications of their respective studies, which analyzed samples of size 200 and greater, whereas Babakus et al.

Thus, our findings are similar to those of Dolan (1994), who concluded that a sample size of 200 is not sufficient to estimate an eight-indicator model with full WLS using polychoric correlations, and to those of Potthast (1993), who reported significant problems when nine-parameter and larger models are estimated with a sample size as large as 1,000.

Our findings suggest that, for normally distributed latent response variables, parameter estimates obtained with full WLS estimation tended to be somewhat positively biased with overestimation increasing as a function of increasing model size and decreasing sample size.

However, these biases were relatively small across all cells of the simulation: Even when the 20-indicator model was estimated with N = 200, estimates of the population factor loading .70 were consistently less than .80, and estimates of the population factor correlation .30 were typically less than .40.

Dolan (1994) found that parameters tended to be slightly overestimated with N = 400 and less, whereas Potthast (1993) concluded that parameter estimate bias was trivial across all cells of her simulation study, which only had two conditions of sample size, N = 500 and N = 1,000.

As predicted, for both full WLS estimation and robust WLS estimation, we found that increasing levels of nonnormality in latent response variables was associated with greater positive bias in parameter estimates, echoing the tendency of polychoric correlations to be positively biased when observed ordinal data derives from nonnormal latent response variables.

Rather, as stated above, our intent was to evaluate the effect of violation of a crucial theoretical assumption for estimation of CFAs using polychoric correlations, namely the latent normality assumption for y*, and the manipulations we chose for our simulations were explicitly targeted to do so.

Because polychoric correlations provide robust estimates of the true correlation even when different sets of thresholds are applied to y* variables, it follows that estimation of CFA models is not substantially affected according to whether or not threshold sets are constant across indicators.

Second, to the extent that the observed ordinal variables have nonzero skewness and kurtosis (e.g., as a result of threshold sets that lead to a dramatically different distribution shape for y relative to a normal or moderately nonnormal y*), full WLS estimation is known to produce biased chi-square test statistics and parameter standard error estimates.

Potthast, 1993).15 Third, when the population y* variables are of extreme nonnormality (e.g., skewness = 5, kurtosis = 50), the likely result is that the observed ordinal variables themselves will also have exaggerated levels of skewness and kurtosis, thus again leading to low expected frequencies in observed contingency tables.

With regard to consideration of the joint effects of underlying nonnormality and varying thresholds across indicators, to the extent that these factors jointly produce observed contingency tables with low (or zero) expected cell frequencies, they are likely to lead to inaccurate polychoric correlations (as shown by Brown &#x00026;

## Cross-validation (statistics)

Cross-validation, sometimes called rotation estimation,[1][2][3] or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.

In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set).[4] The goal of cross-validation is to test the model’s ability to predict new data that were not used in estimating it, in order to flag problems like overfitting[citation needed] and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).

If we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data.

If we use least squares to fit a function in the form of a hyperplane y = a + βTx to the data (xi, yi)&#160;1&#160;≤&#160;i&#160;≤&#160;n, we could then assess the fit using the mean squared error (MSE).

The MSE for given estimated parameter values a and β on the training set (xi, yi)&#160;1&#160;≤&#160;i&#160;≤&#160;n is If the model is correctly specified, it can be shown under mild assumptions that the expected value of the MSE for the training set is (n&#160;−&#160;p&#160;−&#160;1)/(n&#160;+&#160;p&#160;+&#160;1)&#160;&lt;&#160;1 times the expected value of the MSE for the validation set[6] (the expected value is taken over the distribution of training sets).

Since in linear regression it is possible to directly compute the factor (n&#160;−&#160;p&#160;−&#160;1)/(n&#160;+&#160;p&#160;+&#160;1) by which the training MSE underestimates the validation MSE under the assumption that the model specification is valid, cross-validation can be used for checking whether the model has been overfitted, in which case the MSE in the validation set will substantially exceed its anticipated value.

In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two).

In the holdout method, we randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively.

While the holdout method can be framed as 'the simplest kind of cross-validation',[8] many sources instead classify holdout as a type of simple validation, rather than a simple or degenerate form of cross-validation.[9][10] This method, also known as Monte Carlo cross-validation,[11] randomly splits the dataset into training and validation data.

When the value being predicted is continuously distributed, the mean squared error, root mean squared error or median absolute deviation could be used to summarize the errors.

Suppose we choose a measure of fit F, and use cross-validation to produce an estimate F* of the expected fit EF of a model to an independent data set drawn from the same population as the training data.

The variance of F* can be large.[13][14] For this reason, if two statistical procedures are compared based on the results of cross-validation, it is important to note that the procedure with the better estimated performance may not actually be the better of the two procedures (i.e.

In some cases such as least squares and kernel regression, cross-validation can be sped up significantly by pre-computing certain values that are needed repeatedly in the training, or by using fast 'updating rules' such as the Sherman–Morrison formula.

An extreme example of accelerating cross-validation occurs in linear regression, where the results of cross-validation have a closed-form expression known as the prediction residual error sum of squares (PRESS).

For example, if a model for predicting stock values is trained on data for a certain five-year period, it is unrealistic to treat the subsequent five-year period as a draw from the same population.

When this occurs, there may be an illusion that the system changes in external samples, whereas the reason is that the model has missed a critical predictor and/or included a confounded predictor.

New evidence is that cross-validation by itself is not very predictive of external validity, whereas a form of experimental validation known as swap sampling that does control for human bias can be much more predictive of external validity.[15] As defined by this large MAQC-II study across 30,000 models, swap sampling incorporates cross-validation in the sense that predictions are tested across independent training and validation samples.

When there is a mismatch in these models developed across these swapped training and validation samples as happens quite frequently, MAQC-II shows that this will be much more predictive of poor external predictive validity than traditional cross-validation.

In addition to placing too much faith in predictions that may vary across modelers and lead to poor external validity due to these confounding modeler effects, these are some other ways that cross-validation can be misused: Since the order of the data is important, cross-validation might be problematic for time-series models.

For example, suppose we are interested in optical character recognition, and we are considering using either support vector machines (SVM) or k nearest neighbors (KNN) to predict the true character from an image of a handwritten character.

If we simply compared the methods based on their in-sample error rates, the KNN method would likely appear to perform better, since it is more flexible and hence more prone to overfitting[citation needed] compared to the SVM method.

It forms the basis of the validation statistic, Vn which is used to test the statistical validity of meta-analysis summary estimates.[19] It has also been used in a more conventional sense in meta-analysis to estimate the likely prediction error of meta-analysis results.[20]

## M-estimator

In statistics, M-estimators are a broad class of estimators, which are obtained as the minima of sums of functions of the data.

Least-squares estimators are a special case of M-estimators.

The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators.

The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

More generally, an M-estimator may be defined to be a zero of an estimating function.[1][2][3][4][5][6] This estimating function is often the derivative of another statistical function.

For example, a maximum-likelihood estimate is the point where the derivative of the likelihood function with respect to the parameter is zero;

thus, a maximum-likelihood estimator is a critical point of the score function.[7] In many applications, such M-estimators can be thought of as estimating characteristics of the population.

The method of least squares is a prototypical M-estimator, since the estimator is defined as a minimum of the sum of squares of the residuals.

Another popular M-estimator is maximum-likelihood estimation.

For a family of probability density functions f parameterized by θ, a maximum likelihood estimator of θ is computed for each set of data by maximizing the likelihood function over the parameter space {&#160;θ&#160;}&#160;.

When the observations are independent and identically distributed, a ML-estimate

{\displaystyle {\hat {\theta }}}

satisfies or, equivalently, Maximum-likelihood estimators have optimal properties in the limit of infinitely many observations under rather general conditions, but may be biased and not the most efficient estimators for finite samples.

In 1964, Peter J.

Huber proposed generalizing maximum likelihood estimation to the minimization of where ρ is a function with certain properties (see below).

The solutions are called M-estimators ('M' for 'maximum likelihood-type' (Huber, 1981, page 43));

other types of robust estimator include L-estimators, R-estimators and S-estimators.

Maximum likelihood estimators (MLE) are thus a special case of M-estimators.

With suitable rescaling, M-estimators are special cases of extremum estimators (in which more general functions of the observations can be used).

The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, close to the assumed distribution.

M-estimators are solutions, θ, which minimize This minimization can always be done directly.

Often it is simpler to differentiate with respect to θ and solve for the root of the derivative.

When this differentiation is possible, the M-estimator is said to be of ψ-type.

Otherwise, the M-estimator is said to be of ρ-type.

In most practical cases, the M-estimators are of ψ-type.

For positive integer r, let

(

,

)

{\displaystyle ({\mathcal {X}},\Sigma )}

and

(

&#x0398;

&#x2282;

r

,

)

{\displaystyle (\Theta \subset \mathbb {R} ^{r},S)}

be measure spaces.

&#x2208;

&#x0398;

{\displaystyle \theta \in \Theta }

is a vector of parameters.

An M-estimator of ρ-type

{\displaystyle T}

is defined through a measurable function

:

&#x0398;

&#x2192;

{\displaystyle \rho :{\mathcal {X}}\times \Theta \rightarrow \mathbb {R} }

.

It maps a probability distribution

{\displaystyle F}

on

{\displaystyle {\mathcal {X}}}

to the value

(

)

&#x2208;

&#x0398;

{\displaystyle T(F)\in \Theta }

(if it exists) that minimizes

(

x

,

)

d

(

x

)

{\displaystyle \int _{\mathcal {X}}\rho (x,\theta )dF(x)}

:

For example, for the maximum likelihood estimator,

(

x

{\displaystyle \rho (x,\theta )=-\log(f(x,\theta ))}

{\displaystyle f(x,\theta )={\frac {\partial F(x,\theta )}{\partial x}}}

{\displaystyle \rho }

is differentiable, the computation of

{\displaystyle {\widehat {\theta }}}

is usually much easier.

An M-estimator of ψ-type T is defined through a measurable function

{\displaystyle \psi :{\mathcal {X}}\times \Theta \rightarrow \mathbb {R} ^{r}}

It maps a probability distribution F on

{\displaystyle {\mathcal {X}}}

to the value

{\displaystyle T(F)\in \Theta }

(if it exists) that solves the vector equation: For example, for the maximum likelihood estimator,

{\displaystyle \psi (x,\theta )=\left({\frac {\partial \log(f(x,\theta ))}{\partial \theta ^{1}}},\dots ,{\frac {\partial \log(f(x,\theta ))}{\partial \theta ^{p}}}\right)^{\mathrm {T} }}

{\displaystyle u^{\mathrm {T} }}

denotes the transpose of vector u and

{\displaystyle f(x,\theta )={\frac {\partial F(x,\theta )}{\partial x}}}

Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to

{\displaystyle \theta }

then a necessary condition for an M-estimator of ψ-type to be an M-estimator of ρ-type is

{\displaystyle \psi (x,\theta )=\nabla _{\theta }\rho (x,\theta )}

The previous definitions can easily be extended to finite samples.

If the function ψ decreases to zero as

{\displaystyle x\rightarrow \pm \infty }

the estimator is called redescending.

Such estimators have some additional desirable properties, such as complete rejection of gross outliers.

For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required.

It is possible to use standard function optimization algorithms, such as Newton-Raphson.

However, in most cases an iteratively re-weighted least squares fitting algorithm can be performed;

this is typically the preferred method.

For some choices of ψ, specifically, redescending functions, the solution may not be unique.

The issue is particularly relevant in multivariate and regression problems.

Thus, some care is needed to ensure that good starting points are chosen.

Robust starting points, such as the median as an estimate of location and the median absolute deviation as a univariate estimate of scale, are common.

In computation of M-estimators, it is sometimes useful to rewrite the objective function so that the dimension of parameters is reduced.

The procedure is called “concentrating” or “profiling”.

Examples in which concentrating parameters increases computation speed include seemingly unrelated regressions (SUR) models.[8] Consider the following M-estimation problem:

{\displaystyle ({\hat {\beta }}_{n},{\hat {\gamma }}_{n}):=\arg \max _{\beta ,\gamma }\textstyle \sum _{i=1}^{N}\displaystyle q(w_{i},\beta ,\gamma )}

Assuming differentiability of the function q, M-estimator solves the first order conditions:

{\displaystyle \sum _{k=1}^{N}\triangledown \beta q(w_{i},\beta ,\gamma )=0}

{\displaystyle \sum _{i=1}^{N}\triangledown \gamma q(w_{i},\beta ,\gamma )=0}

Now, if we can solve the second equation for γ in terms of

{\displaystyle \beta }

the second equation becomes:

{\displaystyle \sum _{i=1}^{N}\bigtriangledown \gamma q(w_{i},\beta ,g(W,\beta ))=0}

where g is, there is some function to be found.

Now, we can rewrite the original objective function solely in terms of β by inserting the function g into the place of

{\displaystyle \gamma }

As a result, there is a reduction in the number of parameters.

Whether this procedure can be done depends on particular problems at hand.

However, when it is possible, concentrating parameters can facilitate computation to a great degree.

For example, in estimating SUR model of 6 equations with 5 explanatory variables in each equation by Maximum Likelihood, the number of parameters declines from 51 to 30.[8] Despite its appealing feature in computation, concentrating parameters is of limited use in deriving asymptotic properties of M-estimator.[9] The presence of W in each summand of the objective function makes it difficult to apply the law of large numbers and the central limit theorem.

It can be shown that M-estimators are asymptotically normally distributed.

As such, Wald-type approaches to constructing confidence intervals and hypothesis tests can be used.

However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or bootstrap distribution.

The influence function of an M-estimator of

{\displaystyle \psi }

-type is proportional to its defining

{\displaystyle \psi }

Let T be an M-estimator of ψ-type, and G be a probability distribution for which

Its influence function IF is assuming the density function

{\displaystyle f(y)}

A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).

M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.

Let (X1, ..., Xn) be a set of independent, identically distributed random variables, with distribution F.

If we define we note that this is minimized when θ is the mean of the Xs.

Thus the mean is an M-estimator of ρ-type, with this ρ function.

As this ρ function is continuously differentiable in θ, the mean is thus also an M-estimator of ψ-type for ψ(x, θ) = θ&#160;−&#160;x.

For the median estimation of (X1, ..., Xn), instead we can define the ρ function as and similarly, the ρ function is minimized when θ is the median of the Xs.

While this ρ function is not differentiable in θ, the ψ-type M-estimator, which is the subgradient of ρ function, can be expressed as and

Efficiency of estimators

This video details what is meant by the efficiency of an estimator, and why it is a desirable property for an econometric estimator to have. Check out ...

Missing Term Estimation using Difference Operators, IE Part 5

Maximum Likelihood Estimation Examples

for more great signal processing content, including concept/screenshot files, quizzes, MATLAB and data files. Three examples of ..

Moment method estimation: Uniform distribution

estimation of parameters of uniform distribution using method of moments.

Precision, Accuracy, Measurement, and Significant Figures

In this video, I define Precision and Accuracy and use examples to illustrate the differences between them. I discuss the process of using a ruler to measure ...

Session 7: The Beta Regression

This class covered the conventional approach to estimating betas, which is to run a regression of returns on a stock against returns on the market index. We first ...

K-Fold Cross Validation - Intro to Machine Learning

This video is part of an online course, Intro to Machine Learning. Check out the course here: This course was designed ..

A Guide to CoreML on iOS

Apple's newly released CoreML framework makes it super simple for developers to run inference of pre-trained models on their iOS devices. Let's talk about ...

Lecture 05 - Training Versus Testing

Training versus Testing - The difference between training and testing in mathematical terms. What makes a learning model able to generalize? Lecture 5 of 18 of ...

6. Maximum Likelihood Estimation (cont.) and the Method of Moments

MIT 18.650 Statistics for Applications, Fall 2016 View the complete course: Instructor: Philippe Rigollet In this lecture, Prof. Rigollet ..