AI News, Open Machine Learning Course. Topic 10. Gradient Boosting

Open Machine Learning Course. Topic 10. Gradient Boosting

Many data scientists include this algorithm in their data scientist’s toolbox because of the good results it yields on any given (unknown) problem.

Boosting was born out of the question: is it possible to get one strong model from a large amount of relatively weak and simple models? By saying “weak models”, we do not mean simple basic models like decision trees but models with poor accuracy performance, where poor is a little bit better than random.

positive mathematical answer to this question was identified, but it took a few years to develop fully functioning algorithms based on this solution e.g.

These algoritms take a greedy approach: first, they build a linear combination of simple models (basic algorithms) by re-weighing the input data.

Then, the model (usually a decision tree) is built on earlier incorrectly predicted objects, which are now given larger weights.

Let’s have a look at the following toy classification problem where we are going to split the data between the trees of depth 1 (also known as ‘stumps’) on each iteration of AdaBoost.

Although, if we take a weighted vote for the stumps, we will get the correct classifications: Here is a more detailed example of AdaBoost where, as we iterate, we can see the weights increase, especially on the border between classes.

Although, there is a very interesting interview with him about the creation of CART and how they solved statistics problems (which is similar to data analysis and data science today) more than 40 years ago.

From a mathematical perspective, this is not a big change — we are still adding (or boosting) weak algorithms and enlarging our ensemble with gradual improvements for parts of the data where the model was inaccurate.

This problem was rewritten in terms of a loss function that penalizes errors in the output order, so it became convenient to simply insert it into GBM.

With Kaggle, one could test new algorithms on the real data, giving algoritms oppurtunity to “shine”, and provide full information in sharing model performance results across competition data sets.

This algorithm has gone through very typical path for ML algorithms today: mathematical problem and algorithmic crafts to successful practical applications and mass adoption years after its first appearance.

We restore the dependence by approximating f(x) and by understanding which approximation is better when we use the loss function L(y,f) , which we want to minimize: At this moment, we do not make any assumptions regarding the type of dependence f(x), the model of our approximation, or the distribution of the target variable.

Additionally, let’s write out our approximation for a number of M iterations as a sum: Then, the only thing left is to find a suitable, iterative algorithm to minimize the last expression.

Let’s review the steps for this inefficient and naive algorithm: Let’s imagine for a second that we can perform optimization in the function space and iteratively search for the approximations as functions themselves.

we have only decided that we will search for our approximation not as a big model with plenty of parameters (as an example, neural network), but as a sum of functions, pretending we move in functional space.

The constant value, as well as the optimal coefficient, are identified via binary search or another line search algorithm over the initial loss function (not a gradient).

In this toy example, we will restore a noisy function This is a regression problem with a real-valued target, so we will choose to use the mean squared error loss function.

Let’s put together everything we need to use GBM: We will run GBM and draw two types of graphs: the current approximation (blue graph) and every tree built on its pseudo-residuals (green graph).

This was due to the fact that our trees simply did not have enough depth to build a symmetrical branch at once, and it focused on the left branch with the larger error.

To play with GBM function approximations, you can use the awesome interactive demo in this blog called Brilliantly wrong: If we want to solve a classification problem instead of regression, what would change?

The results are the following: The overall results of GBM with quantile loss function are the same as the results with quadratic loss function offset by ~ 0.135.

You can see how the “lower” areas are separating because the trees are more confident in the correct prediction of the negative class and how the two steps of mixed classes are forming.

for churn prediction, it is more useful to predict the churn of clients with high LTV (or lifetime value: how much money a client will bring in the future).

The statistical warrior would invent their own loss function, write out the gradient for it (for more effective training, include the Hessian), and carefully check whether this function satisfies the required properties.

In general, if we know that some subset of data, both in the input variables and in the target variable, has greater importance for our model, then we just assign them a larger weight w(x,y).

The main goal is to fulfill the general requirements for weights: Weights can significantly reduce the time spent adjusting the loss function for the task we are solving and also encourages experiments with the target models’ properties.

We will define a strongly asymmetric weight function as follows: With these weights, we expect to get two properties: less detailing for negative values of X and the form of the function, similar to the initial cosine.

Third, the function that we got on the third iteration received enough attention and started looking similar to the original cosine (also started to slightly overfit).

In addition, this methodology is sufficiently flexible and expandable — it is possible to train a large number of models, taking into consideration different loss-functions with a variety of weighting functions.

Practice and ML competitions show that, in standard problems (except for image, audio, and very sparse data), GBM is often the most effective algorithm (not to mention stacking and high-level ensembles, where GBM is almost always a part of them).

If we used 30 trees instead of 3 and trained the GBM as described, the result would not be that predictable: Full versions of assignments are announced each week in a new run of the course (October 1, 2018).

Gradient boosting ensemble technique for regression.

Kaggle gold medal solution: Mercedes-Benz Greener Manufacturing — Daniel Savenkov [Eng subtitles]

Daniel Savenkov tells his solution of Kaggle Mercedes-Benz Greener Manufacturing competition. In this competition, Daimler is challenging Kagglers to tackle ...

Bioconductor Workshop 1: R/Bioconductor Workshop for Genomic Data Analysis

The Computational Biology Core (CBC) at Brown University (supported by the COBRE Center for Computational Biology of Human Disease) and ...

Integrate Analysis and Interactive Exploration of Data from TCGA - Ilya Shmulevich

November 17-18, 2011 - The Cancer Genome Atlas' 1st Annual Scientific Symposium More:

Quantile regression

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares results in estimates that ...

AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the Gödel Prize in ...

The Future of Genomic Medicine - Anne Wojcicki, Richard Lifton and Eric Green

September 30, 2014 - Genomics and Our Health: What does the future hold? A closing symposium for the exhibition - Genome: Unlocking Life's Code - that ...

Using Networks to Link Genotype to Phenotype

One of the central tenets of biology is that our genetics—our genotype—influences the physical characteristics we manifest—our phenotype. Dr. Quackenbush's ...