AI News, When Bayes, Ockham, and Shannon come together to define machine learning

When Bayes, Ockham, and Shannon come together to define machine learning

It is somewhat surprising that among all the high-flying buzzwords of machine learning, we don’t hear much about the one phrase which fuses some of the core concepts of statistical learning, information theory, and natural philosophy into a single three-word-combo.

And you may be thinking what the heck that is… Let’s peal the layers off and see how useful it is… We start with (not chronologically) with Reverend Thomas Bayes, who by the way, never published his idea about how to do statistical inference, but was later immortalized by the eponymous theorem.

In this essay, Bayes described — in a rather frequentist manner — the simple theorem concerning joint probability which gives rise to the calculation of inverse probability i.e.

This essentially tells that you update your belief (prior probability) after seeing the data/evidence (likelihood) and assign the updated degree of belief to the term posterior probability.

But in the context of machine learning, it can be thought of any set of rules (or logic or process), which we believe, can give rise to the examples or training data, we are given to learn the hidden nature of this mysterious process.

It will take many a volume to describe the genius and strange life of Claude Shannon, who almost single handedly laid the foundation of information theory and ushered us into the age of modern high-speed communication and information exchange.

master’s thesis in electrical engineering has been called the most important MS thesis of the 20th century: in it the 22-year-old Shannon showed how the logical algebra of 19th-century mathematician George Boole could be implemented using electronic circuits of relays and switches.

This most fundamental feature of digital computers’ design — the representation of “true” and “false” and “0” and “1” as open or closed switches, and the use of electronic logic gates to make decisions and to carry out arithmetic — can be traced back to the insights in Shannon’s thesis.

Shannon defined the quantity of information produced by a source — for example, the quantity in a message — by a formula similar to the equation that defines thermodynamic entropy in physics.

Sir Issac Newton: : “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances.” Bertrand Russell: “Whenever possible, substitute constructions out of known entities for inferences to unknown entities.” Need an example about what length of a hypothesis is?

On the other hand, if you create a complex (and long) hypothesis, you may be able to fit your training data really well but this actually may not be the right hypothesis as it runs against the MAP principle of having a hypothesis with small entropy.

What MDL shows is that if a representation of hypotheses is chosen sothat the size of hypothesis h is — log2 P(h), and if a representation for exceptions (errors) is chosen so that the encoding length of D given h is equal to -log2 P(D|h), then the MDL principle produces MAP hypotheses.

It short-circuits the (often) infinitely large hypothesis space and leads us towards a highly probable set of hypothesis which we can optimally encode and work towards finding the set of MAP hypotheses among them.

It is a wonderful fact that such a simple set of mathematical manipulations over a basic identity of probability theory can result in such profound and succinct description of the fundamental limitation and goal of supervised machine learning.

Bayesian inference

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law.

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a 'likelihood function' derived from a statistical model for the observed data.

– the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

P

(

E

∣

H

)

P

(

E

)

Ian Hacking noted that traditional 'Dutch book' arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books.

Note that this is expressed in words as 'posterior is proportional to likelihood times prior', or sometimes as 'posterior = likelihood times prior, over evidence'.

Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point.

By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point.

For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution.

This correctly estimates the variance, due to the fact that (1) the average of normally distributed random variables is also normally distributed;

(2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a student's t-distribution.

In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least, to an arbitrary level of precision, when numerical methods are used.)

In fact, if the prior distribution is a conjugate prior, and hence the prior and posterior distributions come from the same family, it can easily be seen that both prior and posterior predictive distributions also come from the same family of compound distributions.

The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

n

m

n

m

m

m

m

n

m

1

n

{\displaystyle p(\mathbf {\theta } \mid \mathbf {\alpha } )}

1

n

i

P

(

E

∣

M

)

P

(

E

)

(

∣

)

>

(

)

P

(

E

∣

M

)

P

(

E

)

(

∣

)

=

(

)

For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L.

However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable.

To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ('frequentist statistics').[12][citation needed]

x

~

1

2

1

2

1

It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated.

D

¯

G

¯

G

¯

D

¯

C

P

(

E

=

e

∣

C

=

c

)

P

(

E

=

e

)

P

(

E

=

e

∣

C

=

c

)

∫

11

16

P

(

E

=

e

∣

C

=

c

)

f

C

(

c

)

d

c

By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century.

Note that the Bernstein-von Mises theorem asserts here the asymptotic convergence to the 'true' distribution because the probability space corresponding to the discrete set of events

D

¯

G

¯

G

¯

D

¯

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.[15][16][17]

There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.[22]

Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p.

Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.[24][25]

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.[26][27][28]

The Court of Appeal upheld the conviction, but it also gave the opinion that 'To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task.'

argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value).

According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

The problem considered by Bayes in Proposition 9 of his essay, 'An Essay towards solving a Problem in the Doctrine of Chances', is the posterior distribution for the parameter a (the success rate) of the binomial distribution.[citation needed]

However, it was Pierre-Simon Laplace (1749–1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence.[37]

Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called 'inverse probability' (because it infers backwards from observations to parameters, or from effects to causes[38]).

In the subjective or 'informative' current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.[40]

The Bayesian Trap

Bayes' theorem explained with examples and implications for life. Check out Audible: Support Veritasium on Patreon: ..

Probability Theory - The Math of Intelligence #6

We'll build a Spam Detector using a machine learning model called a Naive Bayes Classifier! This is our first real dip into probability theory in the series; I'll talk ...

17. Bayesian Statistics

MIT 18.650 Statistics for Applications, Fall 2016 View the complete course: Instructor: Philippe Rigollet In this lecture, Prof. Rigollet ..

21. Bayesian Statistical Inference I

MIT 6.041 Probabilistic Systems Analysis and Applied Probability, Fall 2010 View the complete course: Instructor: John Tsitsiklis ..

Lecture 4: Conditional Probability | Statistics 110

We introduce conditional probability, independence of events, and Bayes' rule.

What Is The Bayesian Theory?

Bayes' theorem is a way of finding probability when we know certain other probabilities. The formula 14 mar 2017 this article provides an introduction to ...

Introducing Bayes factors and marginal likelihoods

Provides an introduction to Bayes factors which are often used to do model comparison. In using Bayes factors, it is necessary to calculate the marginal ...

Prior And Posterior - Intro to Statistics

This video is part of an online course, Intro to Statistics. Check out the course here:

Birthday probability problem | Probability and Statistics | Khan Academy

The probability that at least 2 people in a room of 30 share the same birthday. Practice this lesson yourself on KhanAcademy.org right now: ...

Conditional probability and combinations | Probability and Statistics | Khan Academy

Probability that I picked a fair coin given that I flipped 4 out of 6 heads. Practice this lesson yourself on KhanAcademy.org right now: ...