Deterministic vs probabilistic record linkage
This post introduces record linkage, compares deterministic and probabilistic approaches, and explains when each method makes sense.
Why record linkage matters
In an ideal world, every person, case, and organisation would share a universal identifier. Linking across data sources would be trivial, and you could answer questions that are otherwise out of reach.
Reality is messier. Most organisations have fragmented data spread across multiple systems, with no reliable, universal identifier. Source systems are designed primarily for operations, not analytics. Identifiers that would make linkage easy can be missing, inconsistently recorded, or unavailable for legal and practical reasons. Data protection constraints can further limit what is captured or retained.
This pattern is common across government, healthcare, finance, and research. Different systems capture overlapping populations but lack a shared key. Personal data can be incomplete, inconsistently recorded, and prone to change over time.
The absence of a reliable identifier constrains the services organisations can provide and the questions they can answer. Without linkage, you cannot track individuals across touchpoints, measure end-to-end journeys, or build a complete picture from fragmented sources.
At scale, record linkage is both a statistical and an engineering challenge. Some datasets run into the tens of millions of rows, and some linkage exercises involve hundreds of millions of records. A naive cartesian join (every record compared with every other) is infeasible. We need approaches that are accurate, transparent about uncertainty, and engineered to scale, including careful blocking strategies that reduce the search space to plausible candidate pairs.
That leads to a simple question: what tools and methods let us link records reliably at scale, while being transparent about uncertainty and robust to messy real-world data?
Deterministic linkage: strengths and limits
Many linkage efforts start with deterministic linkage. In SQL terms, this is a join: if a set of fields match exactly, we treat the records as referring to the same entity. Users typically define combinations of identifiers they believe indicate a match and compile large join conditions, removing previously identified matches from future joins.
For example, you might decide that if two entries match on first name, last name, date of birth, and address, they should be labelled as the same individual. Deterministic linkage can work well when data quality is high and the join keys are reliable. It produces a high precision output, slotting neatly into existing databases and pipelines, and it is computationally efficient (typically a collection of hash-joins in modern database engines).
But the brittleness shows as soon as data quality slips. Completeness issues, typographical errors, formatting inconsistencies, and changes over time can turn true matches into missed matches. In practice, deterministic methods often underperform on recall, and the shortfall is not evenly distributed across groups and contexts.
A concrete example is names. They may be recorded differently across systems, including variations in spelling, spacing, punctuation, ordering, and transliteration. Non-Anglicised names can be particularly vulnerable to inconsistent recording, where phonetic approximations or unfamiliarity lead to persistent discrepancies. If you rely on exact matching, these issues can propagate into biased linkage outcomes, with some individuals systematically less likely to be linked across systems.1
Probabilistic linkage: a different approach
Probabilistic linkage treats the problem differently. Instead of requiring exact agreement on a set of fields, it treats linkage as an evidence-based decision. You define a set of comparisons (for example, name similarity or date-of-birth agreement), and a model combines those signals to estimate how likely it is that two records refer to the same entity.
To see the difference, consider two records that clearly refer to the same person but contain minor discrepancies:
In this instance, even if you used a collection of deterministic rules (for example, searching for a match on name and date of birth, or name and address), you would still miss the match due to the discrepancies.
Probabilistic linkage, by contrast, weighs the evidence from each field. A model trained on your data learns how much evidence each type of agreement (or disagreement) provides. The waterfall chart below shows this in action—the same record pair, but now each comparison contributes to an overall match score:
This matters because it is more robust to imperfect data and gives you a structured way to express uncertainty. Rather than forcing every pair into “match” or “no match”, probabilistic linkage supports practical decision-making. You can separate very high-confidence links from borderline cases that may need review, and be explicit about where uncertainty remains.
The statistical foundations come from the Fellegi-Sunter framework, developed in the 1960s.23 The core idea is that each comparison (name, date of birth, address, etc.) provides evidence for or against a match. Fields that agree on rare values provide stronger evidence than fields that agree on common values. The model combines these signals into an overall match weight or probability.
Choosing an approach
So when should you use each method?
Deterministic linkage works well when:
- Data quality is high and identifiers are reliable
- You have unique, stable keys (e.g. national ID numbers)
- Precision matters more than recall
- Computational simplicity is a priority
Probabilistic linkage is better suited when:
- Data quality is variable or uncertain
- You need to link across systems with inconsistent recording practices
- Recall matters - you want to capture as many true matches as possible and can tolerate some false positives
- You need to quantify and communicate uncertainty
- The population includes groups vulnerable to inconsistent data capture
In practice, many linkage pipelines use both. A deterministic pass can quickly resolve high-confidence matches on strong identifiers, while a probabilistic model handles the remainder. This hybrid approach balances efficiency with robustness.
Splink offers an open-source implementation of probabilistic record linkage designed to scale to large datasets. It includes tools for blocking, comparison, model training, and result analysis, making it easier to implement probabilistic linkage in practice.
References
Further reading
If you want to go deeper on record linkage:
Interactive articles
- Introduction to probabilistic linkage - Robin Linacre
- The mathematics of Fellegi-Sunter - Robin Linacre
Theory guides
Academic papers
Open-source tools
- Splink - Python, scalable probabilistic linkage
- fastLink - R implementation
- Dedupe - Python, machine learning approach
Footnotes
-
Blog post on bias in data linking, including naming inconsistencies and downstream impacts: https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html ↩
-
Robin Linacre’s interactive introduction to probabilistic record linkage: https://www.robinlinacre.com/intro_to_probabilistic_linkage/ ↩
-
Robin Linacre’s explanation of the mathematics behind the Fellegi-Sunter model: https://www.robinlinacre.com/maths_of_fellegi_sunter/ ↩