AI News, Join the Discussion

Join the Discussion

It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets.

However, if you have an independent, and identically distributed (iid), split between train and test data sets, then it’s probably better to come up with a leak-free cross-validation (CV) scheme.

Preventing leakage—the creation of unexpected additional information in the training data set resulting in a machine learning algorithm making unrealistically good predictions—is actually a difficult task and typical k-fold or random split may not always prevent leakage.

When thinking about overfitting, here are some points about the Data Science Bowl to keep in mind: Am I making multiple comparison corrections on my score given a large number of submissions?

As a general rule, you should expect a public to private ranking drop proportional to the number of submissions (rank) unless you follow a very rigid CV scheme.

So, overfitting in this case is not a bad idea when the number of test set rows (observations) is very large (in the billions) and the number of columns (features) is less than the number of rows.

Kaggle Camera Model Identification (1-2 places) — Artur Fattakhov, Ilya Kibardin, Dmitriy Abulkhanov

Artur Fattakhov, Ilya Kibardin and Dmitriy Abulkhanov share their winner's solutions of Kaggle Camera Model Identification. In this competition, Kagglers ...

Kaggle gold medal solution: Mercedes-Benz Greener Manufacturing — Daniel Savenkov [Eng subtitles]

Daniel Savenkov tells his solution of Kaggle Mercedes-Benz Greener Manufacturing competition. In this competition, Daimler is challenging Kagglers to tackle ...

Kaggle Cdiscount’s Image Classification Challenge — Pavel Ostyakov, Alexey Kharlamov

Pavel Ostyakov and Alexey Kharlamov share their solution of Kaggle Cdiscount's Image Classification Challenge. In this competition, Kagglers were challenged ...