moj-analytical-services · RossKen · Jul 17, 2023 · Jul 3, 2023 · Jul 3, 2023 · Jul 4, 2023
diff --git a/docs/img/blocking/cumulative_comparisons.png b/docs/img/blocking/cumulative_comparisons.png
diff --git a/docs/img/blocking/pairwise_comparisons.png b/docs/img/blocking/pairwise_comparisons.png
diff --git a/docs/topic_guides/blocking/blocking_rules.md b/docs/topic_guides/blocking/blocking_rules.md
@@ -0,0 +1,65 @@
+---
+tags:
+  - Blocking
+  - Performance
+---
+
+# The Challenges of Record Linkage
+
+One of the main challenges to overcome in record linkage is the **scale** of the problem.
+
+The number of pairs of records to compare grows using the formula $\frac{n\left(n-1\right)}2$, i.e. with (approximately) the square of the number of records, as shown in the following chart:
+
+![](../../img/blocking/pairwise_comparisons.png)
+
+For example, a dataset of 1 million input records would generate around 500 billion pairwise record comparisons.
+
+So, when datasets get bigger the amount of computational resource gets extremely large (and costly). In reality, we try and reduce the amount of computation required using **blocking**.
+
+## Blocking
+
+Blocking is a technique for reducing the number of record pairs that are considered by a model.
+
+Considering a dataset of 1 million records, comparing each record against all of the other records in the dataset generates ~500 billion pairwise comparisons. However, we know the vast majority of these record comparisons won't be matches, so processing the full ~500 billion comparisons would be largely pointless (as well as costly and time-consuming).
+
+Instead, we can define a subset of potential comparisons using **Blocking Rules**. These are rules that define "blocks" of comparisons that should be considered. For example, the blocking rule:
+
+`"l.first_name = r.first_name and l.surname = r.surname"` 
+
+will generate pairwise record comparisons amongst pairwise comparisons where first name and surname match.
+
+Within a Splink model, you can specify multiple "blocks" through multiple Blocking Rules to ensure all potential matches are considered.
+
+???+ "Further Reading"
+
+    For more information on blocking, please refer to [this article](https://toolkit.data.gov.au/data-integration/data-integration-projects/probabilistic-linking.html#key-steps-in-probabilistic-linking)
+
+### Choosing Blocking Rules
+
+The blocking process is a compromise between the amount of **compuational resource** used when comparing records and **capturing all true matches**. 
+
+Even after blocking, the number of comparisons generated is usually much higher than the number of input records - often between 10 and 1,000 times higher. As a result, the performance of Splink is heavily influenced by the number of comparisons generated by the blocking rules, rather than the number of input records.
+
+Getting the balance right between compuational resource and capturing matches can be tricky, and is largely dependent on the specific datasets and use case of the linkage. In general, we recommend a strategy of starting with strict blocking rules, and gradually loosening them. Sticking to less than 10 million comparisons is a good place to start, before scaling jobs up to 100s of millions (:simple-duckdb: DuckDB on a laptop), or sometimes billions (:simple-apachespark: Spark or :simple-amazonaws: Athena). 
+
+Guidance for choosing Blocking Rules can be found in the two [Blocking in Splink](#blocking-in-splink) topic guides.
+
+!!! note "Taking blocking to the extremes"
+    If you have a large dataset to deduplicate, let's consider the implications of two cases of taking blocking to the extremes:
+
+    **Not enough blocking** (ensuring all matches are captured)  
+    There will be too many record pairs to consider, which will take an extremely long time to run (hours/days) or the process will be so large that it crashes.
+
+    **Too much blocking** (minimising computational resource)  
+    There won't be enough records pairs to consider, so the model won't perform well (or will struggle to be trained at all). 
+
+
+## Blocking in Splink
+
+There are two areas in Splink where blocking is used:
+
+- [Training a Splink model](./blocking_model_training.md)
+- [Making Predictions from a Splink model](./blocking_predictions.md)
+
+each of which is described in their own, dedicated topic guide.
+
diff --git a/docs/topic_guides/blocking/model_training.md b/docs/topic_guides/blocking/model_training.md
@@ -0,0 +1,58 @@
+# Blocking for Model Training
+
+Model Training Blocking Rules choose which record pairs from a dataset get considered when training a Splink model. These are used during Expectation Maximisation (EM), where we estimate the [m probability](./fellegi_sunter.md#m-probability) (in most cases).
+
+The aim of Model Training Blocking Rules is to reduce the number of record pairs considered when training a Splink model in order to reduce the computational resource required. Each Training Blocking Rule define a training "block" of records which have a combination of matches and non-matches that are considered by Splink's Expectation Maximisation algorithm.
+
+The Expectation Maximisation algorithm seems to work best when the pairwise record comparisons are a mix of anywhere between around 0.1% and 99.9% true matches. It works less efficiently if there is a huge imbalance between the two (e.g. a billion non matches and only a hundred matches).
+
+!!! note
+    Unlike [Prediction Rules](./blocking_predictions.md), it does not matter if Training Rules excludes some true matches - it just needs to generate examples of matches and non-matches.
+
+
+## Using Training Rules in Splink
+
+
+Blocking Rules for Model Training are used as a parameter in the `estimate_parameters_using_expectation_maximisation` function. After a `linker` object has been instantiated, you can estimate `m probability` with training sessions such as:
+
+```python
+
+blocking_rule_for_training = "l.first_name = r.first_name"
+linker.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+    )
+
+```
+
+Here, we have defined a "block" of records where `first_name` are the same. As names are not unique, we can be pretty sure that there will be a combination of matches and non-matches in this "block" which is what is required for the EM algorithm.
+
+Matching only on `first_name` will likely generate a large "block" of pairwise comparisons which will take longer to run. In this case it may be worthwhile applying a stricter blocking rule to reduce runtime. For example, a match on `first_name` and `surname`:
+
+```python
+
+blocking_rule_for_training = "l.first_name = r.first_name and l.surname = r.surname"
+linker.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+    )
+
+```
+
+which will still have a combination of matches and non-matches, but fewer record pairs to consider.
+
+
+## Choosing Training Rules
+
+The idea behind Training Rules is to consider "blocks" of record pairs with a mixture of matches and non-matches. In practice, most blocking rules have a mixture of matches and non-matches so the primary consideration should be to reduce the runtime of model training by choosing Training Rules that reduce the number of record pairs in the training set.
+
+There are some tools within Splink to help choosing these rules. For example, the `count_num_comparisons_from_blocking_rule` gives the number of records pairs generated by a blocking rule:
+
+```py 
+
+linker.count_num_comparisons_from_blocking_rule("l.first_name = r.first_name AND l.surname = r.surname")
+
+```
+
+It is recommended that you run this function to check how many comparisons are generated before training a model so that you do not needlessly run a training session on billions of comparisons.
+
+!!! note
+    Unlike [Prediction Rules](./blocking_predictions.md), Training Rules are treated separately for each EM training session therefore the tota number of comparisons for Model Training is simply the sum of `count_num_comparisons_from_blocking_rule` across all Blocking Rules (as opposed to the result of `cumulative_comparisons_from_blocking_rules_records`).
diff --git a/docs/topic_guides/blocking/performance.md b/docs/topic_guides/blocking/performance.md
@@ -0,0 +1,85 @@
+# Blocking Rule Performance
+
+When considering computational performance of blocking rules, there are two main drivers to address:
+
+- How may pairwise comparisons are generated
+- How quickly each pairwise comparison takes to run
+
+Below we run through an example of how to address each of these drivers.
+
+## Strict vs lenient Blocking Rules
+
+One way to reduce the number of comparisons being considered within a model is to apply strict blocking rules. However, this can have a significant impact on the how well the Splink model works.
+
+In reality, we recommend getting a model up and running with strict Blocking Rules and incrementally loosening them to see the impact on the runtime and quality of the results. By starting with strict blocking rules, the linking process will run faster which will means you can iterate through model versions more quickly.
+
+??? example "Example - Incrementally loosening Prediction Blocking Rules"
+
+    When choosing Prediction Blocking Rules, consider how `blocking_rules_to_generate_predictions` may be made incrementally less strict. We may start with the following rule:
+
+    `l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob`.
+
+    This is a very strict rule, and will only create comparisons where full name and date of birth match. This has the advantage of creating few record comparisons, but the disadvantage that the rule will miss true matches where there are typos or nulls in any of these three fields.
+
+    This blocking rule could be loosened to:
+
+    `substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname and l.year_of_birth = r.year_of_birth`
+
+    Now it allows for typos or aliases in the first name, so long as the first letter is the same, and errors in month or day of birth.
+
+    Depending on the side of your input data, the rule could be further loosened to
+
+    `substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname`
+
+    or even
+
+    `l.surname = r.surname`
+
+    The user could use the `linker.count_num_comparisons_from_blocking_rule()` function to select which rule is appropriate for their data.
+
+## Efficient Blocking Rules
+
+While the number of pariwise comparisons is important for reducing the computation, it is also helpful to consider the efficiency of the Blocking Rules. There are a number of ways to define subsets of records (i.e. "blocks"), but they are not all computationally efficient.
+
+From a performance prespective, here we consider two classes of blocking rule:
+
+- Equi-join conditions
+- Filter conditions
+
+### Equi-join Blocking Rules
+
+Equi-joins are simply equality conditions between records, e.g.
+
+`l.first_name = r.first_name`
+
+These equality-based blocking rules are extremely efficient and can be executed quickly, even on very large datasets. 
+
+Equality-based blocking rules should be considered the default method for defining blocking rules and form the basis of the upcoming [Blocking Rules Library](https://github.com/moj-analytical-services/splink/pull/1370).
+
+
+### Filter Blocking Rules
+
+Filter conditions refer to any Blocking Rule that isn't a simple equality between columns. E.g.
+
+`levenshtein(l.surname, r.surname) < 3`
+
+Similarity based blocking rules, such as the example above, are inefficient as the `levenshtein` function needs to be evaluated for all possible record comparisons before filtering out the pairs that do not satisfy the filter condition.
+
+
+### Combining Blocking Rules Efficiently
+
+Just as how Blocking Rules can impact on performance, so can how they are combined. The most efficient Blocking Rules combinations are "AND" statements. E.g.
+
+`l.first_name = r.first_name AND l.surname = r.surname`
+
+"OR" statements are not as efficient and should be used sparingly. E.g.
+
+`l.first_name = r.first_name OR l.surname = r.surname`
+
+
+
+??? note "Spark-specific Further Reading"
+
+    Given the ability to parallelise operations in Spark, there are some additional configuration options which can improve performance of blocking. Please refer to the Spark Performance Topic Guides for more information.
+
+    Note: In Spark Equi-joins can also be referred to as **hashed** rules, and facilitates splitting the workload across multiple machines.
diff --git a/docs/topic_guides/blocking/predictions.md b/docs/topic_guides/blocking/predictions.md
@@ -0,0 +1,89 @@
+# Blocking Rules for Splink Predictions
+
+Prediction Blocking Rules choose which record pairs from a dataset get considered and scored by the Splink model.
+
+The aim of Prediction Blocking Rules are to:
+
+- Capture as many true matches as possible
+- Reduce the total number of comparisons being generated
+
+
+## Using Prediction Rules in Splink
+
+Blocking Rules for Prediction are defined through `blocking_rules_to_generate_predictions` in the Settings dictionary of a model. For example:
+
+``` py hl_lines="3-5"
+settings = {
+    "link_type": "dedupe_only",
+    "blocking_rules_to_generate_predictions": [
+        "l.first_name = r.first_name and l.surname = r.surname"
+    ],
+    "comparisons": [
+        ctl.name_comparison("first_name"),
+        ctl.name_comparison("surname"),
+        ctl.date_comparison("dob", cast_strings_to_date=True),
+        cl.exact_match("city", term_frequency_adjustments=True),
+        ctl.email_comparison("email"),
+    ],
+}
+```
+
+will generate comparisons for all true matches where names match. But it would miss a true match where there was a typo in (say) the first name.
+
+In general, it is usually impossible to find a single rule which both:
+
+- Reduces the number of comparisons generated to a computatally tractable number
+
+- Ensures comparisons are generated for all true matches
+
+This is why `blocking_rules_to_generate_predictions` is a list. Suppose we also block on `postcode`:
+
+```python
+settings_example = {
+    "blocking_rules_to_generate_predictions" [
+        "l.first_name = r.first_name and l.surname = r.surname",
+        "l.postcode = r.postcode"
+        ]
+}
+```
+
+We will now generate a pairwise comparison for the record where there was a typo in the first name, so long as there isn't also a difference in the postcode.
+
+By specifying a variety of `blocking_rules_to_generate_predictions`, it becomes unlikely that a truly matching record would not be captured by at least one of the rules.
+
+!!! note 
+    Unlike [Training Rules](./blocking_model_training.md), Prediction Rules are considered collectively, and are order-dependent. So, in the example above, the `l.postcode = r.postcode` blocking rule only generates record comparisons that are a match on `postcode` were not already captured by the `first_name` and `surname` rule.
+
+## Choosing Prediction Rules
+
+When defining blocking rules it is important to consider the number of pairwise comparisons being generated your the blocking rules. There are a number of useful functions in Splink which can help with this.
+
+Once a linker has been instatiated, we can use the `cumulative_num_comparisons_from_blocking_rules_chart` function to look at the cumulative number of comparisons generated by `blocking_rules_to_generate_predictions`. For example, a setting dictionary like this:
+
+```py
+settings = {
+    "blocking_rules_to_generate_predictions": [
+        "l.first_name = r.first_name",
+        "l.surname = r.surname",
+    ],
+}
+```
+
+will generate the something like:
+
+```
+linker = DuckDBLinker(df, settings)
+linker.cumulative_num_comparisons_from_blocking_rules_chart()
+```
+
+![](../../img/blocking/cumulative_comparisons.png)
+
+Where, similar to the note above, the `l.surname = r.surname` bar in light blue is a count of all record comparisons that match on `surname` that have not already been captured by the `first_name` rule.
+
+You can also return the underlying data for this chart using the `cumulative_comparisons_from_blocking_rules_records` function:
+
+```py
+linker.cumulative_comparisons_from_blocking_rules_records()
+```
+> [{'row_count': 2253, 'rule': 'l.first_name = r.first_name'},  
+> {'row_count': 2568, 'rule': 'l.surname = r.surname'}]