# AI News, Machine Learning FAQ ## Machine Learning FAQ

Index It’s a combinatorial search problem: at each split, we want to find the features that give us “the best bang for the buck” (maximizing information gain).

If we choose a”brute” force approach, our computational complexity is O(m^2), where m is the number of features in our training set, and O(n^2) for the number of n training cases (I think it can be O(n log(n) if you are lucky).

## Machine Learning FAQ

Index It’s a combinatorial search problem: at each split, we want to find the features that give us “the best bang for the buck” (maximizing information gain).

If we choose a”brute” force approach, our computational complexity is O(m^2), where m is the number of features in our training set, and O(n^2) for the number of n training cases (I think it can be O(n log(n) if you are lucky).

## Decision tree learning

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).

Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables.

The arcs coming from a node labeled with an input feature are labeled with each of the possible values of the target or output feature or the arc leads to a subordinate decision node on a different input feature.

This process of top-down induction of decision trees (TDIDT)  is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data[citation needed].

In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorization and generalization of a given set of data.

Decision trees used in data mining are of two main types: The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al. Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split. Some techniques, often called ensemble methods, construct more than one decision tree: A

special case of a decision tree is a decision list, which is a one-sided decision tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node).

While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity, permit non-greedy learning methods and monotonic constraints to be imposed. Decision tree learning is the construction of a decision tree from class-labeled training tuples.

A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label.

Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring 'best'.

Used by the CART (classification and regression tree) algorithm for classification trees, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

are fractions that add up to 1 and represent the percentage of each class present in the child node that results from a split in the tree. Information gain is used to decide which feature to split on at each step in building the tree.

For each node of the tree, the information value 'represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node'. Consider an example data set with four attributes: outlook (sunny, overcast, rainy), temperature (hot, mild, cool), humidity (high, normal), and windy (true, false), with a binary (yes or no) target variable, play, and 14 data points.

This example is adapted from the example appearing in Witten et al. Introduced in CART  and efficiently used in Decision stream, variance reduction is often employed in cases where the target variable is continuous (regression tree), meaning that use of many other metrics would first require discretization before being applied.

are the set of presplit sample indices, set of sample indices for which the split test is true, and set of sample indices for which the split test is false, respectively.

Amongst other data mining methods, decision trees have various advantages: In a decision tree, all paths from the root node to the leaf node proceed by way of conjunction, or AND.

In a decision graph, it is possible to use disjunctions (ORs) to join two more paths together using Minimum message length (MML). Decision graphs have been further extended to allow for previously unstated new attributes to be learnt dynamically and used at different places within the graph. The more general coding scheme results in better predictive accuracy and log-loss probabilistic scoring.[citation needed] In general, decision graphs infer models with fewer leaves than decision trees.

Evolutionary algorithms have been used to avoid local optimal decisions and search the decision tree space with little a priori bias. It is also possible for a tree to be sampled using MCMC. The tree can be searched for in a bottom-up fashion. Many data mining software packages provide implementations of one or more decision tree algorithms.

Several examples include Salford Systems CART (which licensed the proprietary code of the original CART authors), IBM SPSS Modeler, RapidMiner, SAS Enterprise Miner, Matlab, R (an open source software environment for statistical computing which includes several CART implementations such as rpart, party and randomForest packages), Weka (a free and open-source data mining suite, contains many decision tree algorithms), Orange, KNIME, Microsoft SQL Server , and scikit-learn (a free and open-source machine learning library for the Python programming language).

## Random forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.:587–588 The first algorithm for random decision forests was created by Tin Kam Ho using the random subspace method, which, in Ho's formulation, is a way to implement the 'stochastic discrimination' approach to classification proposed by Eugene Kleinberg. An extension of the algorithm was developed by Leo Breiman and Adele Cutler, and 'Random Forests' is their trademark. The extension combines Breiman's 'bagging' idea and random selection of features, introduced first by Ho and later independently by Amit and Geman in order to construct a collection of decision trees with controlled variance.

The general method of random decision forests was first proposed by Ho in 1995.

Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions.

A subsequent work along the same lines concluded that other splitting methods, as long as they are randomly forced to be insensitive to some feature dimensions, behave similarly.

Note that this observation of a more complex classifier (a larger forest) getting more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting.

The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination. The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree.

The idea of random subspace selection from Ho was also influential in the design of random forests.

In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree or each node.

Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Dietterich. The introduction of random forests proper was first made in a paper by Leo Breiman. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging.

In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular: The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.

Decision trees are a popular method for various machine learning tasks.

Tree learning 'come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining', say Hastie et al., 'because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models.

However, they are seldom accurate'.:352 In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e.

have low bias, but very high variance.

Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.:587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners.

Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': or by taking the majority vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias.

This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated.

Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic);

bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x': The number of samples/trees, B, is a free parameter.

Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set.

An optimal number of trees B can be found using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit.

The above procedure describes the original bagging algorithm for trees.

Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features.

This process is sometimes called 'feature bagging'.

The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.

An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. Typically, for a classification problem with p features, √p (rounded down) features are used in each split.:592 For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.:592 Adding one further step of randomization yields extremely randomized trees, or ExtraTrees.

These are trained using bagging and the random subspace method, like in an ordinary random forest, but additionally the top-down splitting in the tree learner is randomized.

Instead of computing the locally optimal feature/split combination (based on, e.g., information gain or the Gini impurity), for each feature under consideration, a random value is selected for the split.

This value is selected from the feature's empirical range (in the tree's training set, i.e., the bootstrap sample). Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way.

The following technique was described in Breiman's original paper and is implemented in the R package randomForest. The first step in measuring the variable importance in a data set

n

=

{

(

i

,

i

)

}

i

=

1

n

{\displaystyle {\mathcal {D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}}

is to fit a random forest to the data.

During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).

To measure the importance of the

j

{\displaystyle j}

-th feature after training, the values of the

j

{\displaystyle j}

-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set.

The importance score for the

j

{\displaystyle j}

-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees.

The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values.

The statistical definition of the variable importance measure was given and analyzed by Zhu et al. This method of determining variable importance has some drawbacks.

For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels.

Methods such as partial permutations and growing unbiased treescan be used to solve the problem.

If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups. A

relationship between random forests and the k-nearest neighbor algorithm (k-NN) was pointed out by Lin and Jeon in 2002. It turns out that both can be viewed as so-called weighted neighborhoods schemes.

These are models built from a training set

{

(

x

i

,

y

i

)

}

i

=

1

n

{\displaystyle \{(x_{i},y_{i})\}_{i=1}^{n}}

that make predictions

y

{\displaystyle {\hat {y}}}

for new points x' by looking at the 'neighborhood' of the point, formalized by a weight function W: Here,

(

x

i

,

x

&#x2032;

)

{\displaystyle W(x_{i},x')}

is the non-negative weight of the i'th training point relative to the new point x' in the same tree.

For any particular x', the weights for points

x

i

{\displaystyle x_{i}}

must sum to one.

Weight functions are given as follows: Since a forest averages the predictions of a set of m trees with individual weight functions

j

{\displaystyle W_{j}}

,

its predictions are This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees.

The neighbors of x' in this interpretation are the points

x

i

{\displaystyle x_{i}}

sharing the same leaf in any tree

j

{\displaystyle j}

.

In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set.

Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature. As part of their construction, random forest predictors naturally lead to a dissimilarity measure among the observations.

One can also define a random forest dissimilarity measure between unlabeled data: the idea is to construct a random forest predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution.

A random forest dissimilarity can be attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations.

The random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection;

for example, the 'Addcl 1' random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables.

The random forest dissimilarity has been used in a variety of applications, e.g.

to find clusters of patients based on tissue marker data. Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers. In machine learning, kernel random forests establish the connection between random forests and kernel methods.

By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze. Leo Breiman was the first person to notice the link between random forest and kernel methods.

He pointed out that random forests which are grown using i.i.d.

random vectors in the tree construction are equivalent to a kernel acting on the true margin.

Lin and Jeon established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates.

Davies and Ghahramani proposed Random Forest Kernel and show that it can empirically outperform state-of-art kernel methods.

Scornet first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest.

He also gave explicit expressions for kernels based on centered random forest and uniform random forest, two simplified models of random forest.

He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.

Centered forest is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute.

The algorithm stops when a fully binary tree of level

k

{\displaystyle k}

is built, where

k

&#x2208;

{\displaystyle k\in \mathbb {N} }

is a parameter of the algorithm.

Uniform forest is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.

Given a training sample

n

=

{

(

i

,

i

)

}

i

=

1

n

{\displaystyle {\mathcal {D}}_{n}=\{(\mathbf {X} _{i},Y_{i})\}_{i=1}^{n}}

of

[

0

,

1

]

p

{\displaystyle [0,1]^{p}\times \mathbb {R} }

-valued independent random variables distributed as the independent prototype pair

(

,

)

{\displaystyle (\mathbf {X} ,Y)}

,

where

&#x2061;

[

2

]

&lt;

{\displaystyle \operatorname {E} [Y^{2}]&lt;\infty }

.

We aim at predicting the response

{\displaystyle Y}

,

associated with the random variable

{\displaystyle \mathbf {X} }

,

by estimating the regression function

m

(

x

)

=

&#x2061;

[

&#x2223;

=

x

]

{\displaystyle m(\mathbf {x} )=\operatorname {E} [Y\mid \mathbf {X} =\mathbf {x} ]}

.

A random regression forest is an ensemble of

{\displaystyle M}

randomized regression trees.

Denote

m

n

{\displaystyle m_{n}(\mathbf {x} ,\mathbf {\Theta } _{j})}

the predicted value at point

{\displaystyle \mathbf {x} }

{\displaystyle j}

-th tree, where

{\displaystyle \mathbf {\Theta } _{1},\ldots ,\mathbf {\Theta } _{M}}

are independent random variables, distributed as a generic random variable

{\displaystyle \mathbf {\Theta } }

independent of the sample

{\displaystyle {\mathcal {D}}_{n}}

This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction.

The trees are combined to form the finite forest estimate

{\displaystyle m_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum _{j=1}^{M}m_{n}(\mathbf {x} ,\Theta _{j})}

For regression trees, we have

{\displaystyle m_{n}=\sum _{i=1}^{n}{\frac {Y_{i}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})}}{N_{n}(\mathbf {x} ,\Theta _{j})}}}

{\displaystyle A_{n}(\mathbf {x} ,\Theta _{j})}

is the cell containing

{\displaystyle \mathbf {x} }

designed with randomness

{\displaystyle \Theta _{j}}

and dataset

{\displaystyle {\mathcal {D}}_{n}}

{\displaystyle N_{n}(\mathbf {x} ,\Theta _{j})=\sum _{i=1}^{n}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})}}

Thus random forest estimates satisfy, for all

{\displaystyle \mathbf {x} \in [0,1]^{d}}

{\displaystyle m_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum _{j=1}^{M}\left(\sum _{i=1}^{n}{\frac {Y_{i}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})}}{N_{n}(\mathbf {x} ,\Theta _{j})}}\right)}

Random regression forest has two level of averaging, first over the samples in the target cell of a tree, then over all trees.

Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells.

In order to improve the random forest methods and compensate the misestimation, Scornet defined KeRF by which is equal to the mean of the

's falling in the cells containing

{\displaystyle \mathbf {x} }

If we define the connection function of the

finite forest as

{\displaystyle K_{M,n}(\mathbf {x} ,\mathbf {z} )={\frac {1}{M}}\sum _{j=1}^{M}\mathbf {1} _{\mathbf {z} \in A_{n}(\mathbf {x} ,\Theta _{j})}}

the proportion of cells shared between

{\displaystyle \mathbf {x} }

{\displaystyle \mathbf {z} }

then almost surely we have

{\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {\sum _{i=1}^{n}Y_{i}K_{M,n}(\mathbf {x} ,\mathbf {x} _{i})}{\sum _{\ell =1}^{n}K_{M,n}(\mathbf {x} ,\mathbf {x} _{\ell })}}}

The construction of Centered KeRF of level

{\displaystyle k}

is the same as for centered forest, except that predictions are made by

{\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})}

the corresponding kernel function, or connection function is Uniform KeRF is built in the same way as uniform forest, except that predictions are made by

{\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})}

the corresponding kernel function, or connection function is Predictions given by KeRF and random forests are close if the number of points in each cell is controlled: Assume that there exist sequences

{\displaystyle (a_{n}),(b_{n})}

such that, almost surely, Then almost surely, When the number of trees

goes to infinity, then we have infinite random forest and infinite KeRF.

Their estimates are close if the number of observations in each cell is bounded: Assume that there exist sequences

{\displaystyle (\varepsilon _{n}),(a_{n}),(b_{n})}

such that, almost surely Then almost surely, Assume that

{\displaystyle Y=m(\mathbf {X} )+\varepsilon }

{\displaystyle \varepsilon }

is a centered Gaussian noise, independent of

{\displaystyle \mathbf {X} }

with finite variance

{\displaystyle \sigma ^{2}&lt;\infty }

{\displaystyle \mathbf {X} }

is uniformly distributed on

{\displaystyle [0,1]^{d}}

{\displaystyle m}

Scornet proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.

{\displaystyle k\rightarrow \infty }

{\displaystyle n/2^{k}\rightarrow \infty }

there exists a constant

{\displaystyle n}

{\displaystyle \mathbb {E} [{\tilde {m}}_{n}^{cc}(\mathbf {X} )-m(\mathbf {X} )]^{2}\leq C_{1}n^{-1/(3+d\log 2)}(\log n)^{2}}

{\displaystyle k\rightarrow \infty }

{\displaystyle n/2^{k}\rightarrow \infty }

there exists a constant

{\displaystyle \mathbb {E} [{\tilde {m}}_{n}^{uf}(\mathbf {X} )-m(\mathbf {X} )]^{2}\leq Cn^{-2/(6+3d\log 2)}(\log n)^{2}}

The algorithm is often used in scientific works because of its advantages.

For example, it can be used for quality assessment of Wikipedia articles.

Anomaly Detection: Algorithms, Explanations, Applications

Anomaly detection is important for data cleaning, cybersecurity, and robust AI systems. This talk will review recent work in our group on (a) benchmarking ...

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

CS50 2016 - Week 0 - Scratch

TOC: 00:00:00 - This is CS50. 00:01:58 - Course Overview 00:03:37 - Introducing Binary 00:09:06 - Binary Bulbs 00:12:05 - ASCII 00:13:52 - RGB 00:15:32 ...

Data Mining with Weka (4.4: Logistic regression)

Data Mining with Weka: online course from the University of Waikato Class 4 - Lesson 4: Logistic regression Slides (PDF): ..

DDes Conference: “Data Across Scales: Reshaping Design” Part 1

The Harvard Graduate School of Design and the Doctor of Design Studies Program are hosting the international interdisciplinary conference Data Across ...

Algebraic Techniques for Multilingual Document Clustering

Google Tech Talks January 25, 2011 Presented by Brett Bader. ABSTRACT Multilingual documents pose difficulties for clustering by topic, not least because ...

Viola Jones face detection and tracking explained

Viola Jones face detection algorithm and tracking is explained. Includes explanation of haar features, integral images, adaboost, cascading classifiers, mean ...

Neural coding of object structure in the ventral visual pathway

A presentation given at Dartmouth College by Ed Connor (Johns Hopkins)

Mod-01 Lec-02 Overview of Pattern Classifiers

Pattern Recognition by Prof. P.S. Sastry, Department of Electronics & Communication Engineering, IISc Bangalore. For more details on NPTEL visit ...