Comma Police: Lessons From the Design and Implementation of a CSV Library
From CSV, to JSON, to YAML, DER, and the dreaded XML, many programmers are dealing with data formats all day. In statically-typed languages like Haskell, we can get a great benefit by imposing a rigid structure of types on the data we consume from these formats. We call this process of imposing structure on data "decoding".
Decoding libraries are available in many languages and for many formats. This talk explores the design decisions behind a new Haskell library, sv, which decodes CSV and similar formats such as PSV. sv addresses what we perceived as problems with other libraries. We will discuss the benefits and drawbacks of the interesting or noteworthy choices made in sv's philosophy and design, and make recommendations as to how these lessons could be applied to other formats.
Outline/Structure of the Talk
We begin with a brief problem description. We will describe the structure of a library for dealing with a data format with examples from QFPL's sv library. These libraries have parse, decode, encode, and print phases. We will discuss the benefits and drawbacks of separating or combining these phases and justify sv's strict separation. This will lead naturally to a discussion of what the library's representation of the format might look like. sv keeps a syntax tree which preserves whitespace and other information that is usually lost. This comes at a penalty in memory usage, but we will see the benefits it has in allowing us to write custom linting and sanitisation tools.
sv does not to use type classes for decoding or encoding. We will justify and motivate this choice. In particular, we can have multiple decoders of each type, including higher-order decoders like Decode a -> Decode (Maybe a). We also avoid the dreaded orphan instance problem.
Benchmarks of sv will be compared to competing CSV libraries such as cassava, and we will see that sv is slower than competing libraries. We will see what can be done about this! In particular, sv's modular design means we can integrate with fast chunks of other libraries, such as cassava's parser. This ties back to the separation of phases mentioned near the beginning, and leads naturally into the conclusion.
Attendees will learn aspects to take into consideration when choosing or designing a library for a data format in a strongly statically typed language. They will learn about the benefits and drawbacks of the style of sv.
Those who are interested in functional library design, particularly using strong static types