Skip to content
Go back

Deterministic vs Probabilistic Record Linkage

Published:  at  10:00 AM

Deterministic vs probabilistic record linkage

This post introduces record linkage, compares deterministic and probabilistic approaches, and explains when each method makes sense.

Why record linkage matters

In an ideal world, every person, case, and organisation would share a universal identifier. Linking across data sources would be trivial, and you could answer questions that are otherwise out of reach.

Reality is messier. Most organisations have fragmented data spread across multiple systems, with no reliable, universal identifier. Source systems are designed primarily for operations, not analytics. Identifiers that would make linkage easy can be missing, inconsistently recorded, or unavailable for legal and practical reasons. Data protection constraints can further limit what is captured or retained.

This pattern is common across government, healthcare, finance, and research. Different systems capture overlapping populations but lack a shared key. Personal data can be incomplete, inconsistently recorded, and prone to change over time.

The absence of a reliable identifier constrains the services organisations can provide and the questions they can answer. Without linkage, you cannot track individuals across touchpoints, measure end-to-end journeys, or build a complete picture from fragmented sources.

At scale, record linkage is both a statistical and an engineering challenge. Some datasets run into the tens of millions of rows, and some linkage exercises involve hundreds of millions of records. A naive cartesian join (every record compared with every other) is infeasible. We need approaches that are accurate, transparent about uncertainty, and engineered to scale, including careful blocking strategies that reduce the search space to plausible candidate pairs.

That leads to a simple question: what tools and methods let us link records reliably at scale, while being transparent about uncertainty and robust to messy real-world data?

Deterministic linkage: strengths and limits

Many linkage efforts start with deterministic linkage. In SQL terms, this is a join: if a set of fields match exactly, we treat the records as referring to the same entity. Users typically define combinations of identifiers they believe indicate a match and compile large join conditions, removing previously identified matches from future joins.

For example, you might decide that if two entries match on first name, last name, date of birth, and address, they should be labelled as the same individual. Deterministic linkage can work well when data quality is high and the join keys are reliable. It produces a high precision output, slotting neatly into existing databases and pipelines, and it is computationally efficient (typically a collection of hash-joins in modern database engines).

But the brittleness shows as soon as data quality slips. Completeness issues, typographical errors, formatting inconsistencies, and changes over time can turn true matches into missed matches. In practice, deterministic methods often underperform on recall, and the shortfall is not evenly distributed across groups and contexts.

A concrete example is names. They may be recorded differently across systems, including variations in spelling, spacing, punctuation, ordering, and transliteration. Non-Anglicised names can be particularly vulnerable to inconsistent recording, where phonetic approximations or unfamiliarity lead to persistent discrepancies. If you rely on exact matching, these issues can propagate into biased linkage outcomes, with some individuals systematically less likely to be linked across systems.1

Probabilistic linkage: a different approach

Probabilistic linkage treats the problem differently. Instead of requiring exact agreement on a set of fields, it treats linkage as an evidence-based decision. You define a set of comparisons (for example, name similarity or date-of-birth agreement), and a model combines those signals to estimate how likely it is that two records refer to the same entity.

To see the difference, consider two records that clearly refer to the same person but contain minor discrepancies:

Deterministic linkage fails on records with typos and missing values
Deterministic linkage: a single typo in surname or a transposed digit in date of birth causes the entire match to fail.

In this instance, even if you used a collection of deterministic rules (for example, searching for a match on name and date of birth, or name and address), you would still miss the match due to the discrepancies.

Probabilistic linkage, by contrast, weighs the evidence from each field. A model trained on your data learns how much evidence each type of agreement (or disagreement) provides. The waterfall chart below shows this in action—the same record pair, but now each comparison contributes to an overall match score:

Probabilistic linkage: each comparison adds or subtracts evidence. Despite imperfect data, the cumulative weight produces a 99% match probability.

This matters because it is more robust to imperfect data and gives you a structured way to express uncertainty. Rather than forcing every pair into “match” or “no match”, probabilistic linkage supports practical decision-making. You can separate very high-confidence links from borderline cases that may need review, and be explicit about where uncertainty remains.

The statistical foundations come from the Fellegi-Sunter framework, developed in the 1960s.23 The core idea is that each comparison (name, date of birth, address, etc.) provides evidence for or against a match. Fields that agree on rare values provide stronger evidence than fields that agree on common values. The model combines these signals into an overall match weight or probability.

Choosing an approach

So when should you use each method?

Deterministic linkage works well when:

Probabilistic linkage is better suited when:

In practice, many linkage pipelines use both. A deterministic pass can quickly resolve high-confidence matches on strong identifiers, while a probabilistic model handles the remainder. This hybrid approach balances efficiency with robustness.

Splink offers an open-source implementation of probabilistic record linkage designed to scale to large datasets. It includes tools for blocking, comparison, model training, and result analysis, making it easier to implement probabilistic linkage in practice.

References


Further reading

If you want to go deeper on record linkage:

Interactive articles

Theory guides

Academic papers

Open-source tools

Footnotes

  1. Blog post on bias in data linking, including naming inconsistencies and downstream impacts: https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html

  2. Robin Linacre’s interactive introduction to probabilistic record linkage: https://www.robinlinacre.com/intro_to_probabilistic_linkage/

  3. Robin Linacre’s explanation of the mathematics behind the Fellegi-Sunter model: https://www.robinlinacre.com/maths_of_fellegi_sunter/


Suggest Changes

Previous Post
A tiny avocado grow log 🥑
Next Post
My shell configuration