エピソード

  • Real Data, Honest Numbers — Entity Resolution in the Trenches
    2026/04/27

    What actually happens when you point an open-source dedup pipeline at a 208,000-row Wake County voter file? What does Splink do that GoldenMatch doesn't, and vice versa? Why did the Bulldozer Blue Book dataset start at F1 of 0.05? In this episode, we walk through a year of running entity resolution on real public datasets — voter rolls, NPPES providers, blockchain wallets, OFAC sanctions, equipment auctions — and the open-source tools we reach for, with the numbers (good and bad) attached.

    Topics covered

    - The 34-second pipeline run on 208,505 NC voter records — what it actually did

    - Why entity resolution matters: healthcare, sanctions, blockchain attribution

    - The four-tool Python ER landscape: Splink, dedupe, RecordLinkage, GoldenMatch

    - The configuration gauntlet — silent decisions that shape your matches before precision/recall ever enters the chat

    - Walking the Golden Suite end-to-end: GoldenCheck → GoldenFlow → GoldenMatch

    - The Bulldozer Blue Book dataset (401K equipment auction records) and the F1 = 0.05 honest moment — and how it climbed to 0.36

    - LLM calibration on the UK Schools register: 6 errors found that the standard profiler missed, for $0.01

    - BPID (Amazon, EMNLP 2024) — the adversarial PII benchmark and what the DOB parsing fix taught us about embeddings vs. boring date parsing

    - DBLP-ACM vs. Febrl — where Splink genuinely wins, and what the setup cost is

    - 13M-row blockchain wallet attribution: OFAC + Etherscan + Sourcify + Forta + DeFiLlama, and the three findings only visible at scale

    - Open-source vulnerability database reconciliation (15 sources)

    - Honest ecommerce dedup: starting at F1 = 0.05, the steps that got it to 0.36

    - The hospital records walkthrough: zero-config vs. explicit vs. LLM-assisted

    - The Turkish retail dataset and pre-dedup data quality findings

    - 10 data problems that show up on every real dataset

    - The TypeScript port (GoldenCheck-TS, InferMap-TS) and the use cases it unlocks

    - MCP integration: handing AI agents a data-cleaning toolkit

    Tools mentioned

    - Splink — Fellegi-Sunter probabilistic ER on DuckDB / Spark / Athena

    - dedupe (Python) — interactive labeling, blocking predicates

    - RecordLinkage — classical, lighter weight

    - GoldenMatch / GoldenCheck / GoldenFlow / GoldenPipe / InferMap — the Golden Suite

    - BPID — Amazon's open-source PII matching benchmark (EMNLP 2024)

    - MCP — Model Context Protocol for AI agent tool access

    Datasets referenced

    - NC State Board of Elections — Wake County active voters (208,505 rows)

    - NPPES — US healthcare provider registry (6M records)

    - Kaggle Bulldozer Blue Book — 401,125 equipment auction records

    - UK Get Information About Schools (GIAS) — 52,288 records

    - DBLP-ACM bibliographic benchmark (4,910 records)

    - Febrl synthetic ER benchmark

    - BPID — 10,000 adversarial PII pairs

    - Public blockchain attribution: OFAC SDN, Etherscan, Sourcify, Forta, DeFiLlama

    - Wagner / OFAC sanctions list (2018 historical)

    - Turkish Superstore retail dataset (5.1M rows)

    Links

    - All blog posts referenced in this episode: [bensevern.dev/blog](https://bensevern.dev/blog)

    - Golden Suite source: [bensevern.dev](https://bensevern.dev)

    - Cover art and pipeline source: open at [github.com/bsevern](https://github.com)

    続きを読む 一部表示
    45 分