schedule May 6th 02:05 - 02:35 PM place Red Room people 91 Interested

Real world data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in data cleaning is entity resolution: connecting all of the duplicate records into the single underlying entity that they represent.

This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of Apache Spark, and scaling it to process billions of records.

 
 

Outline/Structure of the Talk

The session will be split into 3 sections (plus introduction and conclusion), each more concrete than the last:

  1. What is entity resolution? (Approximately 5 minutes) This section motivates the algorithm and establishes common terminology, as audience members may be familiar with the problem but not the name, or entirely unfamiliar. Subsections:
    • Definition: connecting multiple records that represent the same underlying person/object/entity
    • It’s a hard problem: missing data within a record, different representations of the same data. For instance:
      • Names: “J Smith” and “John Smith”
      • Phone numbers: “+61298764321” and “9876 4321”
      • Addresses: “220 Pitt Street Sydney” and “Wesley Conference Centre 220 Pitt St Sydney NSW 2000 Australia”
  2. What is the pSig algorithm? (Approximately 7 minutes) This section describes the algorithm itself, using a specific set of data (such as (name, phone, address) records like the above) to be concrete about how the two stages work. This algorithm can be demonstrated clearly with diagrams. Subsections:
    • Algorithm overview: two core steps, resulting in a graph with “same as” edges between records, with a similarity score attached to each
    • Step 1: Microblocking. Group records into (potentially multiple) “blocks” that have some sort of obvious similarity.
    • Step 2: Scoring potential matches. Within each block, create a graph edge between all its records, and apply some scoring function.
  3. How to best use Spark to run pSig? (Approximately 13 minutes) This section touches on some specifics of an implementation on pSig on Spark, and will work through a series of optimisations that we performed and the resulting improvements. Subsections:
    • DataFrames are faster than RDDs
    • Good data partitioning is important
    • (Row-based) Typed transformations are not as good as User-Defined Functions (UDFs)
    • UDFs are not as good as the built-in operations, which are understood by the optimizer and have code-generation support. This is despite some thoughts to the contrary, e.g. itself Spark had limited API for working with arrays in Datasets/DataFrames until the recent 2.4.0, because UDFs were “good enough”
    • Using a profiler to find slow code focuses effort for fixes, and highlights things like Typed Transformations and UDFs
  4. Conclusion. The audience should understand entity resolution as a problem, how pSig solves it, and some key performance insights from a process of optimising an implementation.

Learning Outcome

After this talk, the audience will:

  • understand entity resolution with the pSig algorithm
  • have some insight into performance and scaling considerations, especially on Spark

Target Audience

Data engineers, and data scientists.

Prerequisites for Attendees

Audiences familiar with Apache Spark will get more out of the talk, but it is not a prerequisite.

schedule Submitted 6 months ago