AI News, A Simple Machine Learning Method to Detect CovariateShift

A Simple Machine Learning Method to Detect CovariateShift

Building a predictive model that performs reasonably well scoring new data in production is a multi-step and iterative process that requires the right mix of training data, feature engineering, machine learning, evaluations, and black art.

Once a model is running “in the wild”, its performance can degrade significantly when the distribution generating the new data varies from the distribution that generated the data used to train the model.

This problem is formally known as Covariate Shift, when the distribution of the inputs used as predictors (covariates) changes between training and production stages, or as Dataset Shift, when the joint distribution of inputs and the output (the target being predicted) also changes.

Both Covariate Shift and Dataset Shift are receiving more attention from the research community. But, in practical settings, how can you automatically detect that there’s a significant difference between training and production data to properly take action and retrain or adjust your model accordingly?

In this post,  I’m going to show you how to use Machine Learning (as it couldn’t be otherwise) to quickly check whether there’s a covariate shift between training data and production data. You read it right: Machine Learning to learn whether machine-learned models will perform well or not.

First of all,  I create a source for the training data (lines 58-61) and another for the production data (lines 63-66) using their respective remote locations defined at the beginning of the script (lines 22 and 23).

Then I create a full dataset for the training source (lines 70-73) and another for the production source (lines 76-79). Then I sample the training data and add a new field named “Origin”, with the value “train” to each instance (lines 90-98) and do the same for the production data but adding the value “production” to each instance (lines 100-108).

I also specify a seed to make sure that I can later evaluate against a complete disjoint subset of the data. Once the model is created, I’m ready to create an evaluation of the new model with the portion of the dataset that I didn’t use to create the model (“out_of_bag”: true).

To quickly check whether there’s a covariate shift between your training and production data you can create a predictive model using a mix of instances from the training and production data. If the model is capable of telling apart training instances from production instances then you can say that there’s a covariate shift.

Utilizing Innovative Statistical Methods and Trial Designs in Rare Disease Settings

TRACO 2016: Epidemiology and SCLC

TRACO 2016: Epidemiology and SCLC Air date: Monday, October 17, 2016, 4:00:00 PM Category: TRACO Runtime: 01:56:38 Description: Epidemiology and ...

Data Analytics Applications Unit 6

3/30/17 Census Scientific Advisory Committee (CSAC) Meeting (Day 1, Part 3)

The Census Bureau hosted the Census Scientific Advisory Committee (CSAC) meeting to acquire feedback from scientists across the nation on our Economic ...

Sentinel Initiative Public Workshop

Chronic Hazard Advisory Panel: Phthalates PM Session

Original broadcast date: July 26, 2010.

Ninth Annual Sentinel Initiative Public Workshop