schedule May 7th 09:50 - 10:20 AM place Red Room people 110 Interested

In this talk we will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach we are going to be looking at is summarization via histograms. Histograms have a number of desirable properties: they work well in an on-line setting, are embarrassingly parallel, and are space-bound. Not to mention they capture the entire (empirical) distribution which is something that otherwise often gets lost when doing descriptive statistics. Building from that we will delve into related problems of sampling in a stream setting, and updating in a batch setting; and highlight some cool tricks such as capturing time-dynamics via data snapshotting. To finish off we will touch upon algorithms to summarize categorical data, most notably count-min sketch.

 
 

Learning Outcome

* What sketch algorithms are (big idea: summarise your data with some data structure and query that)

* When and where are they useful

* How to think about the tradeoffs they bring

* In-depth how histogram sketch works and how it is implemented

Target Audience

Developers looking to implement descriptive statistics on streams of data; practicing data scientists and analysts doing exploratory analysis

Prerequisites for Attendees

* At least a passing familiarity with descriptive statistics
* Familiarity with data structures and algorithms is helpful but not required
* Experience with processing large amounts of data ideally in a streaming setting will make use-cases and applicability more apparent but again not a requirement

schedule Submitted 6 months ago