Covariate Shift - Challenges and Good Practice

location_city Sydney schedule Sep 19th 11:25 - 11:55 AM place Grand Lodge people 72 Interested

A fundamental assumption in supervised machine learning is that both the training and query data are drawn from the same population/distribution. However, in real-life applications this is very often not the case as the query data distribution is unknown and cannot be guaranteed a-priori. Selection bias in collecting training samples will change the distribution of the training data from that of the overall population. This problem is known as covariate shift in the machine learning literature, and using a machine learning algorithm in this situation can result in spurious and often over-confident predictions.

Covariate shift is only detectable when we have access to query data. Visualization of training and query data would be helpful to gain an initial impression. Machine learning models can be used to detect covariate shift. For example, Gaussian Process could model the similarity between each query point from feature space of training data. One-class SVMs could detect outliers of training data. Both strategies detect query points that live in a different domain of the feature space from the training dataset.

We suggest two strategies to mitigate covariate shift: re-weighting training data, and active learning with probabilistic models.

First, re-weighting the training data is the process of matching distribution statistics between the training and query sets in feature space. When the model is trained (and validated) on re-weighted data, it is expected to generalise better to query data. However, significant overlap between training and query datasets is required.

Secondly, there may be a situation where we can acquire the labels of a small portion of the query set, potentially at great expense, to reduce the effects of covariate shift. Probabilistic models are required in this case because they indicate the uncertainty in their prediction. Active learning enables us to optimally select small subsets of query points that aim to maximally shrink the uncertainty in our overall prediction.


Outline/Structure of the Talk

  • Outline the issue of covariate shift and how it can lead to pathological prediction problems
  • How to detect covariate shift when we have query data on hand
  • Detection of covariate shift in live production
  • Regression/classification methods that are robust to covariate shift
  • Two strategies for handling covariate shift
  • First strategy: Re-weighting training data (for training and validation)
  • Second strategy: Active learning with probabilistic models

Learning Outcome

  • Gain an understanding of covariate shift and its effects in production
  • Learn methods for detecting covariate shift in a number of situations you are likely to encounter in production
  • Learn how to effectively deal with covariate shift

Target Audience

Data scientists and machine learners



schedule Submitted 4 years ago

  • Natalia Ruemmele

    Natalia Ruemmele - Cast a Net Over your Data Lake

    Natalia Ruemmele
    Natalia Ruemmele
    Data Scientist
    Data61, CSIRO
    schedule 4 years ago
    Sold Out!
    30 Mins

    As the variety of data continues to expand, the need for different kinds of analytics is increasing – big data is no longer just about the volume, but also about its increasing diversity. Unfortunately, there is no one-size-fits-all approach to analytics – no magic pill that will get your organization the insight it needs from data. Graph analytics offers a toolset to visualize your diverse data and to build more accurate predictive models by uncovering non-obvious inter-connections among your data sources.

    In this talk we will discuss some use cases for graph analytics and walk through a particular scenario to find power-users for a promotion campaign. We will also cover machine learning approaches which can assist you in constructing graphs from diverse data sources.