# AI News, BOOK REVIEW: Competing in a data science contest without reading the data

- On Tuesday, March 6, 2018
- By Read More

## Competing in a data science contest without reading the data

Machine learning competitions have become an extremely popular format for solving

We will see that in Kaggle’s famous Heritage Health Prize competition this might have propelled a participant from rank around 150 into the top 10 on the public leaderboard without making progress on the actual problem.

The point of this post is to illustrate why maintaining a leaderboard that accurately reflects the true performance of each team is a difficult and deep problem.

While there are decades of work on estimating the true performance of a model (or set of models) from a finite sample, the leaderboard application highlights some challenges

A follow-up post will describe a recent paper with Avrim Blum that gives an algorithm for maintaining a (provably) accurate public leaderboard.

Predicting these missing class labels is the goal of the participant and a valid submission is a list of labels—one for each point in the holdout set.

Kaggle specifies a score function that maps a submission consisting of N labels to a numerical score, which we assume to be in [0,1].

That is a prediction incurs loss 0 if it matches the corresponding unknown label and loss 1 if it does not match it.

The public leaderboard is a sorting of all teams according to their score computed only on the \(n\) holdout labels (without using the test labels), while the private leaderboard is the ranking induced by the test labels.

I will let \(s_H(y)\) denote the public score of a submission \(y\), i.e., the score according to the public leaderboard.

Slightly more formally, here’s what I do: Algorithm (Wacky Boosting): Lo and behold, this is what happens: As I’m only seeing the public score (bottom red line), I get super excited.

This introduces a bias in the score and the conditional expected bias of each selected vector \(w_i\) is roughly \(1/2-c/\sqrt{n}\) for some positive constant \(c>0\).

Put differently, each selected \(y_i\) is giving us a guess about each label in the unknown holdout set \(H\subseteq [N]\) that’s correct with probability \(1/2 + \Omega(1/\sqrt{n})\).

summarize, wacky boosting gives us a bias of \(\sqrt{k}\) standard deviations on the public score with \(k\) submissions.

The idea behind the holdout method is that the holdout data serve as a fresh sample providing an unbiased and well-concentrated estimate of the true loss of the classifier on the underlying distribution.

One point of departure from the classic method is that the participants actually do see the data points corresponding to holdout labels which can lead to some problems.

But that’s not the issue here and even if they we don’t look at the holdout data points at all, there’s a fundamental reason why the validity of the classic holdout method breaks down.

The problem is that a submission in general incorporates information about the holdout labels previously released through the leaderboard mechanism.

The primary way Kaggle deals with this problem is by limiting the rate of re-submission and (to some extent) the bit precision of the answers.

Kaggle’s liberal use of the holdout method is just one example of a widespread disconnect between the theory of static data analysis and the practice of interactive data analysis.

Unfortunately, most of the theory on model validation and statistical estimation falls into the static setting requiring independence between method and holdout data.

Nevertheless, we can make some reasonable modeling of what the holdout labels might look like using information that was released by Kaggle and see how well we’d be doing against our random model.

What was really happening in the algorithm is that we had two candidate solutions, the all ones vector and the all zeros vector, and we tried out random coordinate-wise combinations of these vectors.

The algorithm ends up finding a coordinate wise combination of the two vectors that improves upon their mean loss, i.e., one half.

I chose the Heritage Health prize, because it was the highest prized Kaggle competition ever (3 million dollars) and it ran for two years with a substantial number of submissions.

- On Tuesday, October 22, 2019

**Solving the Titanic Kaggle Competition in Azure ML**

In this tutorial we will show you how to complete the titanic Kaggle competition using Microsoft Azure Machine Learning Studio.This video assumes you have an Azure account and you understand...

**Kaggle with Wendy Kan: GCPPodcast 84**

Original post: Wendy Kan joins your co-hosts Francesc and Mark today to talk about Kaggle, their competitions, and the cool..