Entity Resolution at Scale
Real world data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in data cleaning is entity resolution: connecting all of the duplicate records into the single underlying entity that they represent.
This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of Apache Spark, and scaling it to process billions of records.
Outline/Structure of the Talk
The session will be split into 3 sections (plus introduction and conclusion), each more concrete than the last:
- What is entity resolution? (Approximately 5 minutes) This section motivates the algorithm and establishes common terminology, as audience members may be familiar with the problem but not the name, or entirely unfamiliar. Subsections:
- Definition: connecting multiple records that represent the same underlying person/object/entity
- It’s a hard problem: missing data within a record, different representations of the same data. For instance:
- Names: “J Smith” and “John Smith”
- Phone numbers: “+61298764321” and “9876 4321”
- Addresses: “220 Pitt Street Sydney” and “Wesley Conference Centre 220 Pitt St Sydney NSW 2000 Australia”
- Algorithm overview: two core steps, resulting in a graph with “same as” edges between records, with a similarity score attached to each
- Step 1: Microblocking. Group records into (potentially multiple) “blocks” that have some sort of obvious similarity.
- Step 2: Scoring potential matches. Within each block, create a graph edge between all its records, and apply some scoring function.
- DataFrames are faster than RDDs
- Good data partitioning is important
- (Row-based) Typed transformations are not as good as User-Defined Functions (UDFs)
- UDFs are not as good as the built-in operations, which are understood by the optimizer and have code-generation support. This is despite some thoughts to the contrary, e.g. itself Spark had limited API for working with arrays in Datasets/DataFrames until the recent 2.4.0, because UDFs were “good enough”
- Using a profiler to find slow code focuses effort for fixes, and highlights things like Typed Transformations and UDFs
After this talk, the audience will:
- understand entity resolution with the pSig algorithm
- have some insight into performance and scaling considerations, especially on Spark
Data engineers, and data scientists.
Prerequisites for Attendees
Audiences familiar with Apache Spark will get more out of the talk, but it is not a prerequisite.