Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking Topic Guides #1389

Merged
merged 27 commits into from
Jul 17, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
979e190
initial commit
RossKen Jul 3, 2023
3fb67f6
start feleshing out prediction blocking guide
RossKen Jul 3, 2023
c2edf55
Flesh out BR performance guide
RossKen Jul 4, 2023
1c06ddb
add choosing BRs guidance
RossKen Jul 4, 2023
2df607b
lint with black
RossKen Jul 4, 2023
47ff672
lint
RossKen Jul 4, 2023
3d0b8a9
Merge branch 'blocking_docs' of github.com:moj-analytical-services/sp…
RossKen Jul 4, 2023
ae411c0
lint with black
RossKen Jul 4, 2023
32cdc05
remove blocking predictions notebook
RossKen Jul 4, 2023
d39c700
improve topic guide folder structure
RossKen Jul 6, 2023
61c814c
fix relative paths
RossKen Jul 12, 2023
7f8b61a
tight -> strict
RossKen Jul 12, 2023
e8a4183
add extremes example
RossKen Jul 12, 2023
2afde3a
Merge branch 'master' into blocking_docs
RossKen Jul 12, 2023
3a59c49
Fix performance section
RossKen Jul 12, 2023
c8689b3
add options for "new" flag
RossKen Jul 12, 2023
c26fee9
Update docs/topic_guides/blocking/performance.md
RossKen Jul 14, 2023
f44e22a
Update docs/topic_guides/blocking/performance.md
RossKen Jul 14, 2023
2414a6d
Update docs/topic_guides/blocking/performance.md
RossKen Jul 14, 2023
db554bb
Update docs/topic_guides/blocking/performance.md
RossKen Jul 14, 2023
1d779d7
Update docs/topic_guides/blocking/blocking_rules.md
RossKen Jul 14, 2023
be7e80c
Update docs/topic_guides/blocking/blocking_rules.md
RossKen Jul 14, 2023
bd9f83d
Merge branch 'master' into blocking_docs
RossKen Jul 17, 2023
b166c50
additional wording
RossKen Jul 17, 2023
88bdd06
Merge branch 'master' into blocking_docs
RossKen Jul 17, 2023
c762a61
updates for brl
RossKen Jul 17, 2023
b4beaf1
lint
RossKen Jul 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/img/blocking/cumulative_comparisons.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/blocking/pairwise_comparisons.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
65 changes: 65 additions & 0 deletions docs/topic_guides/blocking/blocking_rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
tags:
- Blocking
- Performance
---

# The Challenges of Record Linkage

One of the main challenges to overcome in record linkage is the **scale** of the problem.

The number of pairs of records to compare grows using the formula $\frac{n\left(n-1\right)}2$, i.e. with (approximately) the square of the number of records, as shown in the following chart:

![](../../img/blocking/pairwise_comparisons.png)

For example, a dataset of 1 million input records would generate around 500 billion pairwise record comparisons.

So, when datasets get bigger the amount of computational resource gets extremely large (and costly). In reality, we try and reduce the amount of computation required using **blocking**.

## Blocking

Blocking is a technique for reducing the number of record pairs that are considered by a model.

Considering a dataset of 1 million records, comparing each record against all of the other records in the dataset generates ~500 billion pairwise comparisons. However, we know the vast majority of these record comparisons won't be matches, so processing the full ~500 billion comparisons would be largely pointless (as well as costly and time-consuming).

Instead, we can define a subset of potential comparisons using **Blocking Rules**. These are rules that define "blocks" of comparisons that should be considered. For example, the blocking rule:

`"l.first_name = r.first_name and l.surname = r.surname"`

will generate pairwise record comparisons amongst pairwise comparisons where first name and surname match.
RossKen marked this conversation as resolved.
Show resolved Hide resolved

Within a Splink model, you can specify multiple "blocks" through multiple Blocking Rules to ensure all potential matches are considered.
RossKen marked this conversation as resolved.
Show resolved Hide resolved

???+ "Further Reading"

For more information on blocking, please refer to [this article](https://toolkit.data.gov.au/data-integration/data-integration-projects/probabilistic-linking.html#key-steps-in-probabilistic-linking)

### Choosing Blocking Rules

The blocking process is a compromise between the amount of **compuational resource** used when comparing records and **capturing all true matches**.
RossKen marked this conversation as resolved.
Show resolved Hide resolved

Even after blocking, the number of comparisons generated is usually much higher than the number of input records - often between 10 and 1,000 times higher. As a result, the performance of Splink is heavily influenced by the number of comparisons generated by the blocking rules, rather than the number of input records.

Getting the balance right between compuational resource and capturing matches can be tricky, and is largely dependent on the specific datasets and use case of the linkage. In general, we recommend a strategy of starting with strict blocking rules, and gradually loosening them. Sticking to less than 10 million comparisons is a good place to start, before scaling jobs up to 100s of millions (:simple-duckdb: DuckDB on a laptop), or sometimes billions (:simple-apachespark: Spark or :simple-amazonaws: Athena).

Guidance for choosing Blocking Rules can be found in the two [Blocking in Splink](#blocking-in-splink) topic guides.

!!! note "Taking blocking to the extremes"
If you have a large dataset to deduplicate, let's consider the implications of two cases of taking blocking to the extremes:

**Not enough blocking** (ensuring all matches are captured)
There will be too many record pairs to consider, which will take an extremely long time to run (hours/days) or the process will be so large that it crashes.

**Too much blocking** (minimising computational resource)
There won't be enough records pairs to consider, so the model won't perform well (or will struggle to be trained at all).


## Blocking in Splink

There are two areas in Splink where blocking is used:

RossKen marked this conversation as resolved.
Show resolved Hide resolved
- [Training a Splink model](./blocking_model_training.md)
- [Making Predictions from a Splink model](./blocking_predictions.md)

each of which is described in their own, dedicated topic guide.

58 changes: 58 additions & 0 deletions docs/topic_guides/blocking/model_training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Blocking for Model Training

Model Training Blocking Rules choose which record pairs from a dataset get considered when training a Splink model. These are used during Expectation Maximisation (EM), where we estimate the [m probability](./fellegi_sunter.md#m-probability) (in most cases).

The aim of Model Training Blocking Rules is to reduce the number of record pairs considered when training a Splink model in order to reduce the computational resource required. Each Training Blocking Rule define a training "block" of records which have a combination of matches and non-matches that are considered by Splink's Expectation Maximisation algorithm.

The Expectation Maximisation algorithm seems to work best when the pairwise record comparisons are a mix of anywhere between around 0.1% and 99.9% true matches. It works less efficiently if there is a huge imbalance between the two (e.g. a billion non matches and only a hundred matches).

!!! note
Unlike [Prediction Rules](./blocking_predictions.md), it does not matter if Training Rules excludes some true matches - it just needs to generate examples of matches and non-matches.


## Using Training Rules in Splink


Blocking Rules for Model Training are used as a parameter in the `estimate_parameters_using_expectation_maximisation` function. After a `linker` object has been instantiated, you can estimate `m probability` with training sessions such as:

```python

blocking_rule_for_training = "l.first_name = r.first_name"
linker.estimate_parameters_using_expectation_maximisation(
blocking_rule_for_training
)

```

Here, we have defined a "block" of records where `first_name` are the same. As names are not unique, we can be pretty sure that there will be a combination of matches and non-matches in this "block" which is what is required for the EM algorithm.

Matching only on `first_name` will likely generate a large "block" of pairwise comparisons which will take longer to run. In this case it may be worthwhile applying a stricter blocking rule to reduce runtime. For example, a match on `first_name` and `surname`:

```python

blocking_rule_for_training = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(
blocking_rule_for_training
)

```

which will still have a combination of matches and non-matches, but fewer record pairs to consider.


## Choosing Training Rules

The idea behind Training Rules is to consider "blocks" of record pairs with a mixture of matches and non-matches. In practice, most blocking rules have a mixture of matches and non-matches so the primary consideration should be to reduce the runtime of model training by choosing Training Rules that reduce the number of record pairs in the training set.

There are some tools within Splink to help choosing these rules. For example, the `count_num_comparisons_from_blocking_rule` gives the number of records pairs generated by a blocking rule:

```py

linker.count_num_comparisons_from_blocking_rule("l.first_name = r.first_name AND l.surname = r.surname")

```

It is recommended that you run this function to check how many comparisons are generated before training a model so that you do not needlessly run a training session on billions of comparisons.

!!! note
Unlike [Prediction Rules](./blocking_predictions.md), Training Rules are treated separately for each EM training session therefore the tota number of comparisons for Model Training is simply the sum of `count_num_comparisons_from_blocking_rule` across all Blocking Rules (as opposed to the result of `cumulative_comparisons_from_blocking_rules_records`).
85 changes: 85 additions & 0 deletions docs/topic_guides/blocking/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Blocking Rule Performance

When considering computational performance of blocking rules, there are two main drivers to address:

- How may pairwise comparisons are generated
- How quickly each pairwise comparison takes to run

Below we run through an example of how to address each of these drivers.

## Strict vs lenient Blocking Rules

One way to reduce the number of comparisons being considered within a model is to apply strict blocking rules. However, this can have a significant impact on the how well the Splink model works.

In reality, we recommend getting a model up and running with strict Blocking Rules and incrementally loosening them to see the impact on the runtime and quality of the results. By starting with strict blocking rules, the linking process will run faster which will means you can iterate through model versions more quickly.

??? example "Example - Incrementally loosening Prediction Blocking Rules"

When choosing Prediction Blocking Rules, consider how `blocking_rules_to_generate_predictions` may be made incrementally less strict. We may start with the following rule:

`l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob`.

This is a very strict rule, and will only create comparisons where full name and date of birth match. This has the advantage of creating few record comparisons, but the disadvantage that the rule will miss true matches where there are typos or nulls in any of these three fields.

This blocking rule could be loosened to:

`substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname and l.year_of_birth = r.year_of_birth`

Now it allows for typos or aliases in the first name, so long as the first letter is the same, and errors in month or day of birth.

Depending on the side of your input data, the rule could be further loosened to

`substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname`

or even

`l.surname = r.surname`

The user could use the `linker.count_num_comparisons_from_blocking_rule()` function to select which rule is appropriate for their data.

## Efficient Blocking Rules

While the number of pariwise comparisons is important for reducing the computation, it is also helpful to consider the efficiency of the Blocking Rules. There are a number of ways to define subsets of records (i.e. "blocks"), but they are not all computationally efficient.

From a performance prespective, here we consider two classes of blocking rule:

- Equi-join conditions
- Filter conditions

### Equi-join Blocking Rules
RossKen marked this conversation as resolved.
Show resolved Hide resolved

Equi-joins are simply equality conditions between records, e.g.

`l.first_name = r.first_name`

These equality-based blocking rules are extremely efficient and can be executed quickly, even on very large datasets.

Equality-based blocking rules should be considered the default method for defining blocking rules and form the basis of the upcoming [Blocking Rules Library](https://github.com/moj-analytical-services/splink/pull/1370).


### Filter Blocking Rules
RossKen marked this conversation as resolved.
Show resolved Hide resolved

Filter conditions refer to any Blocking Rule that isn't a simple equality between columns. E.g.

`levenshtein(l.surname, r.surname) < 3`

Similarity based blocking rules, such as the example above, are inefficient as the `levenshtein` function needs to be evaluated for all possible record comparisons before filtering out the pairs that do not satisfy the filter condition.


### Combining Blocking Rules Efficiently

Just as how Blocking Rules can impact on performance, so can how they are combined. The most efficient Blocking Rules combinations are "AND" statements. E.g.

`l.first_name = r.first_name AND l.surname = r.surname`

"OR" statements are not as efficient and should be used sparingly. E.g.

`l.first_name = r.first_name OR l.surname = r.surname`


RossKen marked this conversation as resolved.
Show resolved Hide resolved

??? note "Spark-specific Further Reading"

Given the ability to parallelise operations in Spark, there are some additional configuration options which can improve performance of blocking. Please refer to the Spark Performance Topic Guides for more information.

Note: In Spark Equi-joins can also be referred to as **hashed** rules, and facilitates splitting the workload across multiple machines.
RossKen marked this conversation as resolved.
Show resolved Hide resolved
89 changes: 89 additions & 0 deletions docs/topic_guides/blocking/predictions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Blocking Rules for Splink Predictions

Prediction Blocking Rules choose which record pairs from a dataset get considered and scored by the Splink model.

The aim of Prediction Blocking Rules are to:

- Capture as many true matches as possible
- Reduce the total number of comparisons being generated


## Using Prediction Rules in Splink

Blocking Rules for Prediction are defined through `blocking_rules_to_generate_predictions` in the Settings dictionary of a model. For example:

``` py hl_lines="3-5"
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name and l.surname = r.surname"
],
"comparisons": [
ctl.name_comparison("first_name"),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
ctl.email_comparison("email"),
],
}
```

will generate comparisons for all true matches where names match. But it would miss a true match where there was a typo in (say) the first name.

In general, it is usually impossible to find a single rule which both:

- Reduces the number of comparisons generated to a computatally tractable number

- Ensures comparisons are generated for all true matches

This is why `blocking_rules_to_generate_predictions` is a list. Suppose we also block on `postcode`:

```python
settings_example = {
"blocking_rules_to_generate_predictions" [
"l.first_name = r.first_name and l.surname = r.surname",
"l.postcode = r.postcode"
]
}
```

RossKen marked this conversation as resolved.
Show resolved Hide resolved
We will now generate a pairwise comparison for the record where there was a typo in the first name, so long as there isn't also a difference in the postcode.

By specifying a variety of `blocking_rules_to_generate_predictions`, it becomes unlikely that a truly matching record would not be captured by at least one of the rules.

!!! note
Unlike [Training Rules](./blocking_model_training.md), Prediction Rules are considered collectively, and are order-dependent. So, in the example above, the `l.postcode = r.postcode` blocking rule only generates record comparisons that are a match on `postcode` were not already captured by the `first_name` and `surname` rule.

## Choosing Prediction Rules

When defining blocking rules it is important to consider the number of pairwise comparisons being generated your the blocking rules. There are a number of useful functions in Splink which can help with this.

Once a linker has been instatiated, we can use the `cumulative_num_comparisons_from_blocking_rules_chart` function to look at the cumulative number of comparisons generated by `blocking_rules_to_generate_predictions`. For example, a setting dictionary like this:

```py
settings = {
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
}
```

will generate the something like:

```
linker = DuckDBLinker(df, settings)
linker.cumulative_num_comparisons_from_blocking_rules_chart()
```

![](../../img/blocking/cumulative_comparisons.png)

Where, similar to the note above, the `l.surname = r.surname` bar in light blue is a count of all record comparisons that match on `surname` that have not already been captured by the `first_name` rule.

You can also return the underlying data for this chart using the `cumulative_comparisons_from_blocking_rules_records` function:

```py
linker.cumulative_comparisons_from_blocking_rules_records()
```
> [{'row_count': 2253, 'rule': 'l.first_name = r.first_name'},
> {'row_count': 2568, 'rule': 'l.surname = r.surname'}]
Loading