Image Classification in a Noisy Fraudulent World - A Journey of Computational and Statistical Performance
Formbay's fraud detection system relies on classification of photographic evidence to verify solar installations. Over the last 10 years, Formbay has amassed over 10 million labelled images of solar installations. Image classification over Formbay's dataset sounds easy. Lots of data, apply neural networks and profit from automation! However with such a large dataset, there is room for lots of noise. Noise such as mislabelled images, overlapping classes, corrupted image data, imbalanced classes, rotational variance and more.
This presentation demonstrates how we built our Image Processing pipeline tackling these noise issues while addressing class/concept drift. First we'll examine the data-situation of Formbay when we started and our initial model. Then we'll address each statistical and computational problem we met and how we decided to address them, slowly evolving our data pipeline over time.
This presentation focuses on the complexities of engineering production ready ML systems which involve balancing between statistical ("how accurate") and computational performance ("how fast").
Outline/Structure of the Talk
The presentation will cover topics in this order:
- The Situation - Introduce the Compliance Checking Process for Solar Trading Credits
- Solar Inverter - Introduce the aim of Classification and the data situation at Formbay
- Cleaning Data - Starting with a clean minimal subset in order to start building a model around solid foundations
- CNN - Introduction to CNNs and what was used prior to CNNs and now how CNNs blown away all previous methods
- Resnet - Introduction of the Resnet architecture and the software and hardware requirements of Resnet
- Pipeline - Building the pipeline and the evolution of the pipeline
- Longtail and Pareto Principle of Class Distribution- Resulting in the need to threshold, balance and augment classes
- Rotation Invariance - CNNs are not rotation invariant
- Overfitting - Overfitting problem due to the overlapping classes
- Hierarchical Class Architecture - Creation of a new class map schema and the resolution of visually indistinguishable classes
- Processing Performance - Solving performance issues via multi-gpu support and rearchitecting the computation pipeline graph
- Machine Learning in the Cloud - Cost optimisation
Audience will gain an understanding how to apply image classification to deal with real world noisy datasets and changing business conditions. Specifically they will gain an understanding about the kinds of subtle problems that occur at scale and how complex a production machine learning system can be.
Machine learning engineers, data engineers, data analysts.
Prerequisites for Attendees
Basic knowledge about about:
* Data pipelines.
* Software development.
* Image classification.
Not specific to any language, but we do use Python.