Experimenting with Distributed Data Processing in Haskell
Apache Spark is one of the most popular data processing frameworks in the world, and is widely used in the enterprise. Its popularity is due in no small part to its adoption of the functional paradigm: it demonstrates the advantages of purity, higher-order functions and laziness to simplify the processing of large datasets. Haskell excels at all of those things; so it is only natural to think that Haskell would be a good fit for distributed data processing. Tweag.io's Sparkle and Soostone's Hadron are a few examples in the Haskell ecosystem.
'distributed-dataset' is a framework written in Haskell designed to efficiently process large amount of data. With the StaticPointers extension of GHC Haskell, we are able to distribute a computation across different machines; and using the technique described by Matei et al that led to Apache Spark, we can express and execute large scale data transforms using a pretty DSL.
In this talk, I am going to give a brief introduction to the library, and then move on to explaining the key implementation ideas and the advantages that Haskell offers to distributed data processing.
: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" by Matei Zaharia, et al.
Outline/Structure of the Talk
I will start by introducing the concept of 'Dataset'; a partitioned multiset alongside with a set of transformations. In order to implement this in Haskell, we are going to need a few mechanisms(StaticPointers and a "Dataset" type); next part of the talk will be about those.
After we have the core machinery in place; I will show that we can build some to utilities to concisely express common transformations, including typed and composable aggregations.
At the end of the talk, I will give some examples for close to real world uses alongside with some performance figures; and finally mention the related work in the Haskell ecosystem.
After this talk, hopefully people would consider using Haskell and this library when they need to process datasets which exceeds a single computers memory.
As an added bonus, the talk will mention a lesser-known but very useful GHC extension called StaticPointers, which opens a lot of possibilities on distributed computing which would surely interest some people.
People who are interested in data processing and/or functional programming.
Prerequisites for Attendees
Knowledge of basic Haskell syntax would be useful, but not required.