# AI News, Consistency in DataScience

## Consistency in DataScience

I took Kaggle competitions to measure internal validity in data science.

Kaggle has a handy rule for detecting overfitting: Kaggle competitions are decided by your model&#8217;s performance on a test data set.

Your Public score is what you receive back upon each submission (that score is calculated using a statistical evaluation metric, which is always described on the Evaluation page).

When the competition ends, we take your selected submissions (see below) and score your predictions against the REMAINING FRACTION of the test set, or the private portion.

Perfectly consistent solutions would have similar scores on both public (horizontal axis) and private (vertical axis) leaderboards.

Points moving away from the diagonals say that solutions don&#8217;t digest the new data well and their predictive power is declining.

When a data scientist gets a high score by luck, he won&#8217;t retain the position on the private leaderboard.

One is &#8220;restaurant revenue prediction&#8220;: predicting revenues for restaurants given geography and demographics.

Data analysis can&#8217;t help here until the company gets the data on thousands of other restaurants.

competition is the MIT&#8217;s course homework for predicting successful blog posts also suffers from too many factors affecting the outcome.

Algorithmic trading relies on handpicked cases with unique data models, and the competition offers just the opposite.

## How to Lead a Data Science Contest without Reading the Data

By Moritz Hardt Machine learning competitions have become an extremely popular format for solving prediction and classification problems of all sorts.

We will see that in Kaggle’s famous Heritage Health Prize competition this might have propelled a participant from rank around 150 into the top 10 on the public leaderboard without making progress on the actual problem.

The point of this post is to illustrate why maintaining a leaderboard that accurately reflects the true performance of each team is a difficult and deep problem.

While there are decades of work on estimating the true performance of a model (or set of models) from a finite sample, the leaderboard application highlights some challenges that while fundamental have only recently seen increased attention.

A follow-up post will describe a recent paper with Avrim Blum that gives an algorithm for maintaining a (provably) accurate public leaderboard.

Predicting these missing class labels is the goal of the participant and a valid submission is a list of labels—one for each point in the holdout set.

Kaggle specifies a score function that maps a submission consisting of N labels to a numerical score, which we assume to be in [0,1].

That is a prediction incurs loss 0 if it matches the corresponding unknown label and loss 1 if it does not match it.

The public leaderboard is a sorting of all teams according to their score computed only on the $$n$$ holdout labels (without using the test labels), while the private leaderboard is the ranking induced by the test labels.

The cautionary tale of wacky boosting Imagine your humble blogger in a parallel universe: I’m new to this whole machine learning craze.

Slightly more formally, here’s what I do: Algorithm (Wacky Boosting): Lo and behold, this is what happens: As I’m only seeing the public score (bottom red line), I get super excited.

This introduces a bias in the score and the conditional expected bias of each selected vector $$w_i$$ is roughly $$1/2-c/\sqrt{n}$$ for some positive constant $$c&gt;0$$.

Put differently, each selected $$y_i$$ is giving us a guess about each label in the unknown holdout set $$H\subseteq [N]$$ that’s correct with probability $$1/2 + \Omega(1/\sqrt{n})$$.

To summarize, wacky boosting gives us a bias of $$\sqrt{k}$$ standard deviations on the public score with $$k$$ submissions.

Why the holdout method breaks down The idea behind the holdout method is that the holdout data serve as a fresh sample providing an unbiased and well-concentrated estimate of the true loss of the classifier on the underlying distribution.

One point of departure from the classic method is that the participants actually do see the data points corresponding to holdout labels which can lead to some problems.

But that’s not the issue here and even if they we don’t look at the holdout data points at all, there’s a fundamental reason why the validity of the classic holdout method breaks down.

The problem is that a submission in general incorporates information about the holdout labels previously released through the leaderboard mechanism.

Static vs interactive data analysis Kaggle’s liberal use of the holdout method is just one example of a widespread disconnect between the theory of static data analysis and the practice of interactive data analysis.

Unfortunately, most of the theory on model validation and statistical estimation falls into the static setting requiring independence between method and holdout data.

Nevertheless, we can make some reasonable modeling of what the holdout labels might look like using information that was released by Kaggle and see how well we’d be doing against our random model.

What was really happening in the algorithm is that we had two candidate solutions, the all ones vector and the all zeros vector, and we tried out random coordinate-wise combinations of these vectors.

The algorithm ends up finding a coordinate wise combination of the two vectors that improves upon their mean loss, i.e., one half.

I chose the Heritage Health prize, because it was the highest prized Kaggle competition ever (3 million dollars) and it ran for two years with a substantial number of submissions.

## The Dangers of Overfitting or How to Drop 50 spots in 1 minute

This post was originally published on Gregory Park's blog.  Reprinted with permission from the author (thanks Gregory!) Over the last month and a half, the Online Privacy Foundation hosted a Kaggle competition, in which competitors attempted to predict psychopathy scores based on abstracted Twitter activity from a couple thousand users.

One of the goals of the competition is to determine how much information about one’s personality can be extracted from Twitter, and by hosting the competition on Kaggle, the Online Privacy Foundation can sit back and watch competitors squeeze every bit of predictive ability out of the data, trying to predict the psychopathy scores of 1,172 Twitter users.

Competitors can submit two sets of predictions each day, and each submission is scored from 0 (worst) to 1 (best) using a metric known as “average precision“.

However, the public leaderboard score isn’t actually the “true” score – it is only an estimate based on a small portion of the submission.

When the competition ends, five submissions from each competitor are compared to the full set of test data (all 1,172 Twitter accounts), and the highest scoring submission from each user is used to calculate the final score.

I wasn’t the only one who took a big fall: the top five users on the public leaderboard ended up in 64th, 52nd, 58th, 16th, and 57th on the private leaderboard, respectively.

Below, I’ve plotted the number of entries from each user against their final standing on the public and private leaderboards and added a trend line to each plot.

The problem is that submissions that score well using this approach probably will not generalize to the full set of test data when the competition closes.

Because the public leaderboard is only based on a small portion of the test data, it is only a rough estimate of the true quality of a submission, and cross-validation gives a sort of second opinion of a submission’s quality.

It turned out that my cross-validation estimates were not related to the private scores at all (notice the horizontal linear trends in those scatterplots), and the public leaderboard wasn’t any better.

Here are some things I’ll take into the next contest: One of my best submissions (average precision = .86294) was actually one of my own benchmarks that took very little thought.

After imputing missing values in the training and test set with medians, I used the gbm package in R to fit a boosting model using every column in the data as predictor.

## How to Rank 10% in Your First Kaggle Competition

Kaggle is the best place to learn from other data scientists.

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.i.d.

You can use test results to roughly set a threshold for determining whether an increase of score is due to genuine improvment or randomness.

Some common steps are: How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage.

In practice, I recommend using Jupyter Notebook for data manipulation and mastering usage of frequently used Pandas operations.

For a categorical variable with n possible values, we create a group of n dummy variables.

Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to 1 while other dummies in the same group are all set to 0.

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning.

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features.

Yet there’s still something to gain from feature selection beforehand: The simplest way to inspect feature importance is by fitting a random forest model.

You can combat noisy data (to an extent) simply by increasing number of trees used in a random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

Kaggle competitions usually favor tree-based models: The following models are slightly worse in terms of general performance, but are suitable as base models in ensemble learning (will be discussed later): Note that this does not apply to computer vision competitions which are pretty much dominated by neural network models.

For example, the most important parameters for a random forset is the number of trees in the forest and the maximum number of features used in developing each tree.

We need to understand how models work and what impact does each parameter have to the model’s performance, be it accuracy, robustness or speed.

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

These parameters are generally considered to have real impacts on its performance: Usual tuning steps: Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed.

It reduces both bias and variance of the final model (you can find a proof here), thus increasing the score and reducing the risk of overfitting.

Common approaches of ensemble learning are: In theory, for the ensemble to perform well, two factors matter: Actually we have a trade-off here.

This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data.

After 5 iterations we will obtain a matrix of shape #(samples in training data) X #(base models).

After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color.

I’ll only give a brief summary here: Note that features listed above with * are the last batch of features I added.

As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.

I thought of including linear regression, SVM regression and XGBRegressor with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used.

During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission.

It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker.

Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak.

After the competition, I found out that my best single model scores 0.46378 on the private leaderboard, whereas my best stacking ensemble scores 0.45849.

Kaggle gold medal solution: Mercedes-Benz Greener Manufacturing — Daniel Savenkov [Eng subtitles]

Daniel Savenkov tells his solution of Kaggle Mercedes-Benz Greener Manufacturing competition. In this competition, Daimler is challenging Kagglers to tackle ...

Solving the Titanic Kaggle Competition in Azure ML

In this tutorial we will show you how to complete the titanic Kaggle competition using Microsoft Azure Machine Learning Studio.This video assumes you have an ...

Intro to Kaggle (Cloud Next '18)

This session will introduce you to Kaggle, a platform for doing and sharing data science. You may have heard about some of their competitions, which often have ...

How to Hire Two Million Heads Without Losing Yours (Cloud Next '18)

This session will introduce you to Kaggle competitions, and how it can help your enterprise leverage the community to solve your data science problems.

Collections as Data: Impact

Building on the success of its “Collections as Data” symposium last year, the Library of Congress National Digital Initiatives (NDI) again will host a daylong ...

Hovercraft: Takedown

iTunes - Google Play - Amazon - Thugs have taken over the the Hovercraft

Kaggle Porto Seguro's Safe Driver Prediction (3rd place solution) — Dmitry Altukhov

Dmitry Altukhov (utility) tells his solution of Kaggle Porto Seguro's Safe Driver Prediction. In this competition, Kagglers were challenged to build a model that ...

Challenge.gov Webinar Series: What Drives Competitors

Learn the key motivators and incentives that influence people to enter prize competitions. Is it for personal satisfaction, reputation, career-advancement, ...

Flexiglass for Mac - Nulana

See more on

Day 3 Keynote: Made Here Together (Cloud Next '18)

Understand from our leading developer pioneers why and how more developers are rapidly innovating with Google Cloud. This video is also subtitled in ...