AI News, Boosting (machine learning)

Boosting (machine learning)

Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance[1] in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones.[2] Boosting is based on the question posed by Kearns and Valiant (1988, 1989):[3][4] Can a set of weak learners create a single strong learner?

Robert Schapire's affirmative answer in a 1990 paper[5] to the question of Kearns and Valiant has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.[6] When first introduced, the hypothesis boosting problem simply referred to the process of turning a weak learner into a strong learner.

After a weak learner is added, the data are reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority and BrownBoost).

Other algorithms that are similar in spirit to boosting algorithms are sometimes called 'leveraging algorithms', although they are also sometimes incorrectly called boosting algorithms.[9] The main variation between many boosting algorithms is their method of weighting training data points and hypotheses.

However, research has shown that object categories and their locations in images can be discovered in an unsupervised manner as well.[10] The recognition of object categories in images is a challenging problem in computer vision, especially when the number of categories is large.

Background clutter and partial occlusion add difficulties to recognition as well.[11] Humans are able to recognize thousands of object types, whereas most of the existing object recognition systems are trained to recognize only a few, e.g., human face, car, simple objects, etc.[12] Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories[13]) have been developed.

false positive rate.[14] Another application of boosting for binary categorization is a system that detects pedestrians using patterns of motion and appearance.[15] This work is the first to combine both motion information and appearance information as features to detect a walking person.

This can be done via converting multi-class classification into a binary one (a set of categories versus the rest),[16] or by introducing a penalty error from the categories that do not have the feature of the classifier.[17] In the paper 'Sharing visual features for multiclass and multiview object detection', A.

Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximately logarithmically with the number of class, i.e., slower than linear growth in the non-sharing case.

AdaBoost

Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier.[1][2] When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples.

Problems in machine learning often suffer from the curse of dimensionality — each sample may consist of a huge number of potential features (for instance, there can be 162,336 Haar features, as used by the Viola–Jones object detection framework, in a 24×24 pixel image window), and evaluating every feature can reduce not only the speed of classifier training and execution, but in fact reduce predictive power, per the Hughes Effect.[3] Unlike neural networks and SVMs, the AdaBoost training process selects only those features known to improve the predictive power of the model, reducing dimensionality and potentially improving execution time as irrelevant features need not be computed.

For example, in the two class problem, the sign of the weak learner output identifies the predicted object class and the absolute value gives the confidence in that classification.

These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favor splitting sets of samples with high weights.

-th iteration we want to extend this to a better boosted classifier by adding a multiple of one of the weak classifiers: So it remains to determine which weak classifier is the best choice for

i

m

−

1

i

i

m

i

i

m

i

i

m

−

1

i

i

m

i

i

m

i

i

m

i

1

−

ϵ

m

ϵ

m

i

μ

^

i

1

n

μ

^

i

i

i

i

i

As long as the loss function is monotonic and continuously differentiable, the classifier is always driven toward purer solutions.[5] Zhang (2004) provides a loss function based on least squares, a modified Huber loss function: This function is more well-behaved than LogitBoost for

), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.

i

i

−

y

i

f

(

x

i

)

t

1

t

n

i

i

t

∑

i

w

i

e

−

y

i

h

i

α

t

≤

∑

i

(

1

−

y

i

h

i

2

)

w

i

e

α

t

+

∑

i

(

1

+

y

i

h

i

2

)

w

i

e

−

α

t

=

(

ϵ

t

2

)

e

α

t

+

(

1

−

ϵ

t

2

)

e

−

α

t

{\displaystyle {\begin{aligned}\sum _{i}w_{i}e^{-y_{i}h_{i}\alpha _{t}}\leq \sum _{i}\left({\frac {1-y_{i}h_{i}}{2}}\right)w_{i}e^{\alpha _{t}}+\sum _{i}\left({\frac {1+y_{i}h_{i}}{2}}\right)w_{i}e^{-\alpha _{t}}\\=\left({\frac {\epsilon _{t}}{2}}\right)e^{\alpha _{t}}+\left({\frac {1-\epsilon _{t}}{2}}\right)e^{-\alpha _{t}}\end{aligned}}}

(

ϵ

t

2

)

e

α

t

−

(

1

−

ϵ

t

2

)

e

−

α

t

=

0

α

t

=

1

2

ln

⁡

(

1

−

ϵ

t

ϵ

t

)

{\displaystyle {\begin{aligned}\left({\frac {\epsilon _{t}}{2}}\right)e^{\alpha _{t}}-\left({\frac {1-\epsilon _{t}}{2}}\right)e^{-\alpha _{t}}=0\\\alpha _{t}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{t}}{\epsilon _{t}}}\right)\end{aligned}}}

i

t

i

F

t

−

1

(

x

)

+

f

t

(

p

(

x

)

)

(typically chosen using weighted least squares error): Thus, rather than multiplying the output of the entire tree by some fixed value, each leaf node is changed to output half the logit transform of its previous value.

technique for speeding up processing of boosted classifiers, early termination refers to only testing each potential object with as many layers of the final classifier necessary to meet some confidence threshold, speeding up computation for cases where the class of the object can easily be determined.

One such scheme is the object detection framework introduced by Viola and Jones:[9] in an application with significantly more negative samples than positive, a cascade of separate boost classifiers is trained, the output of each stage biased such that some acceptably small fraction of positive samples is mislabeled as negative, and all samples marked as negative after each stage are discarded.

This method has since been generalized, with a formula provided for choosing optimal thresholds at each stage to achieve some desired false positive and false negative rate.[10] In the field of statistics, where AdaBoost is more commonly applied to problems of moderate dimensionality, early stopping is used as a strategy to reduce overfitting.[11] A validation set of samples is separated from the training set, performance of the classifier on the samples used for training is compared to performance on the validation samples, and training is terminated if performance on the validation sample is seen to decrease even as performance on the training set continues to improve.

is chosen at each layer t to minimize test error, the next layer added is said to be maximally independent of layer t:[12] it is unlikely to choose a weak learner t+1 that is similar to learner t.

The simplest methods, which can be particularly effective in conjunction with totally corrective training, are weight- or margin-trimming: when the coefficient, or the contribution to the total test error, of some weak classifier falls below a certain threshold, that classifier is dropped.

17. Learning: Boosting

MIT 6.034 Artificial Intelligence, Fall 2010 View the complete course: Instructor: Patrick Winston Can multiple weak classifiers be used to make a strong one? ..

13. Classification

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag Prof. Guttag introduces supervised..

Predicting the Winning Team with Machine Learning

Can we predict the outcome of a football game given a dataset of past games? That's the question that we'll answer in this episode by using the scikit-learn machine learning library as our...

Lecture 16 | Adversarial Examples and Adversarial Training

In Lecture 16, guest lecturer Ian Goodfellow discusses adversarial examples in deep learning. We discuss why deep networks and other machine learning models are susceptible to adversarial examples,...

Viola Jones face detection and tracking explained

Viola Jones face detection algorithm and tracking is explained. Includes explanation of haar features, integral images, adaboost, cascading classifiers, mean shift tracking and Camshift tracking....

AdaBoost

AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the Gödel Prize in 2003 for their work. It can be used in...

CppCon 2017: Kate Gregory “10 Core Guidelines You Need to Start Using Now”

— Presentation Slides, PDFs, Source Code and other presenter materials are available at: — The C++ Core Guidelines were announced.

Robust Face Recognition Using Recognition-by-Parts, Boosting, and Transduction

One of the main challenges for computational intelligence is to understand how people detect and categorize objects in general, and process and recognize each other's face, in particular. The...

AdaBoost 2D [MATLAB Code Demo]

AdaBoost 2D [MATLAB Code Demo] Description: This video shows a MATLAB program that performs the classification of two different classes using the AdaBoost algorithm. In this demo, the samples...

JSConf Iceland 2018 Day 2 Hekla Track - Live

From Harpa, Reykjavik on March 2nd, 2018. See the schedule at: