Batch as a Special Case of Streaming
In this talk I will share my teams gruelling journey in attempting to migrate a batch like system into a streaming framework.
Walking through the various solutions that we tested using Flink, I'll be discussing each ones performance characteristics and bringing to light misconceptions in their designs.
Outline/Structure of the Case Study
- Description of the problem
- Why we chose streaming
- Solution: Naive Flink
- Overview of watermarks
- How Flink handles window state
- Solution: Flink all in one window
- Limitations in Flinks windowing interface
- Solution: Flink with extrnal datastore
- Solution: Flink with state on Kafka
- In depth overview of solution
- Illustration of why this doesn't work
- Solution: Flink with stateful map
- Advantages over other solutions
- Conclusion
- Discussion on why the solutions did not meet objectives
Learning Outcome
Understand how to approach problems involving large aggregation windows in Flink.
Be able to identify batch solutions that will not work in a streaming system.
Understand the difference in performance characteristics between streaming and batch systems.
Target Audience
Anyone interested in migrating batch systems to use a streaming framework.
Prerequisites for Attendees
Reading the following blog article could be helpful:
https://data-artisans.com/blog/batch-is-a-special-case-of-streaming
Links
https://docs.google.com/presentation/d/1u5LUroPyFhoasdXY5BN57A-TJ64fCmSl8oJfp5Cm5X4/edit?usp=sharing