Is the 370 the worst bus in Sydney?
In Switzerland, people will be surprised at a bus that's 2min late. In Sydney, people will only consider it noteworthy if a bus is more than 20min late, and this varies greatly between routes and providers. So, how do Sydney bus routes stack up? And if we're talking about privatisation, how do the private bus providers stack up against the state busses?
To answer these questions we need data… lots of data. Hooray for open government data! Transport for NSW publishes real-time information on the location and lateness of all public transport. Unfortunately it's ephemeral – there is no public log of historical lateness for us to analyse. To gather the data I needed I had to fetch, log and aggregate ephemeral real-time data that was never intended to be used this way. There are random gaps and spontaneous route or timetable changes for special events, roadworks or holidays. Even with noisy data, the patterns start to emerge across months and we can start to answer some questions. The 370 bus route is one of the most complained about routes in Sydney, it even has it's own Facebook group of ironic fans... but is it really the worst bus? Let's look at the data.
Outline/Structure of the Case Study
Part 1: Introduction and rationale
- Amusing examples and anecdotes about the 370 bus
- An introduction to the Realtime delay data provided by Transport for NSW
- An overview of the origin of the GTFS Static and Realtime data formats
Part 2: Implementation
- How I gathered the data
- How much data it was (~500GB over 4 months)
- How I processed the data
- Issues in dealing with inconsistencies in the data
- Issues scaling and parallelising the processing
Part 3: Results
- What it means for a bus to be on time, according to NSW and Victoria.
- Which buses are the most and least on time
- Which busses are most frequently more than 20min late
- Which bus agencies are most and least on time.
People interested in a fun use of open government data.