Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusters topic guide #1883

Merged
merged 55 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
a810626
Start of metrics topic guide
zslade Jan 24, 2024
978c3d9
Merge branch 'master' into clusters_topic_guide
zslade Jan 24, 2024
1445571
restructure intro
zslade Jan 30, 2024
3660259
update
zslade Feb 1, 2024
e705a1a
rearrange and fill in gaps
zslade Feb 1, 2024
4557558
updates
zslade Feb 2, 2024
5a18587
merge latest
zslade Feb 5, 2024
932528d
split out sections
zslade Feb 5, 2024
253f26a
fix sections
zslade Feb 5, 2024
232a14a
Update sections
zslade Feb 5, 2024
744226d
update overview/intro
zslade Feb 5, 2024
7bd6878
tweaking intro
zslade Feb 5, 2024
18851d3
tweaks
zslade Feb 6, 2024
20026b2
update density
zslade Feb 6, 2024
4a8bc93
update node degree
zslade Feb 6, 2024
5b70ed7
remove directed etc
zslade Feb 6, 2024
2bca2d7
tweak explanations
zslade Feb 6, 2024
7e4cc91
fleshing out how to guide
zslade Feb 6, 2024
3c46852
update how to and small tweaks
zslade Feb 6, 2024
c707179
Merge branch 'master' into clusters_topic_guide
zslade Feb 6, 2024
d3f9998
reorder
zslade Feb 12, 2024
21e2fd3
cluster centralisation
zslade Feb 12, 2024
c7d460e
small improvements
zslade Feb 12, 2024
bea498a
improvements
zslade Feb 12, 2024
e3ee66c
Merge branch 'master' into clusters_topic_guide
zslade Feb 12, 2024
5fea483
remove average and absolute
zslade Feb 12, 2024
e0e7495
improving centralisation explaination
zslade Feb 12, 2024
7f28ee7
update link
zslade Feb 12, 2024
05ea72d
small tweak
zslade Feb 17, 2024
f8e880c
remove graph definition
zslade Feb 17, 2024
f2d4ccb
Merge branch 'master' into clusters_topic_guide
zslade Feb 29, 2024
336c94a
Merge branch 'master' into clusters_topic_guide
zslade Mar 4, 2024
be5c9a6
Merge branch 'master' into clusters_topic_guide
RossKen Mar 28, 2024
6200739
minor edits
RossKen Mar 28, 2024
8ba2692
Merge branch 'master' into clusters_topic_guide
zslade Mar 28, 2024
016bb00
changes based off comments
zslade Mar 28, 2024
66ef851
Delete docs/comparison_level_library.md
zslade Mar 28, 2024
13e12b9
Delete docs/datasets.md
zslade Mar 28, 2024
f45d9bf
Delete docs/comparison_library.md
zslade Mar 28, 2024
38ab7b6
Delete docs/comparison_template_library.md
zslade Mar 28, 2024
7d694ca
Delete docs/comparison_level_composition.md
zslade Mar 28, 2024
e8cecfa
Merge branch 'master' into clusters_topic_guide
zslade Apr 2, 2024
9bfac5a
tweaks
zslade Apr 2, 2024
2e3f180
tweak
zslade Apr 2, 2024
858a16e
resolving comments and more tweaks
zslade Apr 2, 2024
22bf41d
update to notebook
zslade Apr 2, 2024
01b60b6
update and fix links
zslade Apr 2, 2024
39d0f34
spellcheck
zslade Apr 2, 2024
fd77b6e
add more graphic metric visuals
RossKen Apr 2, 2024
2032c8f
add cluster centralisation caveat
RossKen Apr 3, 2024
e7d08fe
Merge branch 'master' into clusters_topic_guide
RossKen Apr 3, 2024
d898a0a
add back useful density text
zslade Apr 3, 2024
c85ca71
re-add comparison libraries docs
RossKen Apr 4, 2024
100f2d2
add missing md doc
RossKen Apr 4, 2024
63496b3
fix clusters doc link
RossKen Apr 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/comparison_level_composition.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,12 @@ For example, `or_(null_level("first_name"), null_level("surname"))` creates a ch

The Splink comparison level composition functions available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_composition_library_dialect_table.md" %}
||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
|:-:|:-:|:-:|:-:|:-:|:-:|
|[and_](#splink.comparison_level_composition.and_)|✓|✓|✓|✓|✓|
|[not_](#splink.comparison_level_composition.not_)|✓|✓|✓|✓|✓|
|[or_](#splink.comparison_level_composition.or_)|✓|✓|✓|✓|✓|

zslade marked this conversation as resolved.
Show resolved Hide resolved



Expand Down
18 changes: 17 additions & 1 deletion docs/comparison_level_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,23 @@ However, not every comparison level is available for every [Splink-compatible SQ

The pre-made Splink comparison levels available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_level_library_dialect_table.md" %}
||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
|:-:|:-:|:-:|:-:|:-:|:-:|
|[array_intersect_level](#splink.comparison_level_library.ArrayIntersectLevelBase)|✓|✓|✓||✓|
|[columns_reversed_level](#splink.comparison_level_library.ColumnsReversedLevelBase)|✓|✓|✓|✓|✓|
|[damerau_levenshtein_level](#splink.comparison_level_library.DamerauLevenshteinLevelBase)|✓|✓||✓||
|[datediff_level](#splink.comparison_level_library.DatediffLevelBase)|✓|✓|✓||✓|
|[distance_function_level](#splink.comparison_level_library.DistanceFunctionLevelBase)|✓|✓|✓|✓|✓|
|[distance_in_km_level](#splink.comparison_level_library.DistanceInKmLevelBase)|✓|✓|✓||✓|
|[else_level](#splink.comparison_level_library.ElseLevelBase)|✓|✓|✓|✓|✓|
|[exact_match_level](#splink.comparison_level_library.ExactMatchLevelBase)|✓|✓|✓|✓|✓|
|[jaccard_level](#splink.comparison_level_library.JaccardLevelBase)|✓|✓||||
|[jaro_level](#splink.comparison_level_library.JaroLevelBase)|✓|✓||✓||
|[jaro_winkler_level](#splink.comparison_level_library.JaroWinklerLevelBase)|✓|✓||✓||
|[levenshtein_level](#splink.comparison_level_library.LevenshteinLevelBase)|✓|✓|✓|✓|✓|
|[null_level](#splink.comparison_level_library.NullLevelBase)|✓|✓|✓|✓|✓|
|[percentage_difference_level](#splink.comparison_level_library.PercentageDifferenceLevelBase)|✓|✓|✓|✓|✓|




Expand Down
14 changes: 13 additions & 1 deletion docs/comparison_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,19 @@ However, not every comparison is available for every [Splink-compatible SQL back

The pre-made Splink comparisons available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_library_dialect_table.md" %}
||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
|:-:|:-:|:-:|:-:|:-:|:-:|
|[array_intersect_at_sizes](#splink.comparison_library.ArrayIntersectAtSizesBase)|✓|✓|✓||✓|
|[damerau_levenshtein_at_thresholds](#splink.comparison_library.DamerauLevenshteinAtThresholdsBase)|✓|✓||✓||
|[datediff_at_thresholds](#splink.comparison_library.DatediffAtThresholdsBase)|✓|✓|✓||✓|
|[distance_function_at_thresholds](#splink.comparison_library.DistanceFunctionAtThresholdsBase)|✓|✓|✓|✓|✓|
|[distance_in_km_at_thresholds](#splink.comparison_library.DistanceInKmAtThresholdsBase)|✓|✓|✓||✓|
|[exact_match](#splink.comparison_library.ExactMatchBase)|✓|✓|✓|✓|✓|
|[jaccard_at_thresholds](#splink.comparison_library.JaccardAtThresholdsBase)|✓|✓||||
|[jaro_at_thresholds](#splink.comparison_library.JaroAtThresholdsBase)|✓|✓||✓||
|[jaro_winkler_at_thresholds](#splink.comparison_library.JaroWinklerAtThresholdsBase)|✓|✓||✓||
|[levenshtein_at_thresholds](#splink.comparison_library.LevenshteinAtThresholdsBase)|✓|✓|✓|✓|✓|




Expand Down
9 changes: 8 additions & 1 deletion docs/comparison_template_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,14 @@ However, not every comparison is available for every [Splink-compatible SQL back

The pre-made Splink comparison templates available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_template_library_dialect_table.md" %}
||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
|:-:|:-:|:-:|:-:|:-:|:-:|
|[date_comparison](#splink.comparison_template_library.DateComparisonBase)|✓|✓||||
|[email_comparison](#splink.comparison_template_library.EmailComparisonBase)|✓|✓||||
|[forename_surname_comparison](#splink.comparison_template_library.ForenameSurnameComparisonBase)|✓|✓||✓||
|[name_comparison](#splink.comparison_template_library.NameComparisonBase)|✓|✓||✓||
|[postcode_comparison](#splink.comparison_template_library.PostcodeComparisonBase)|✓|✓|✓|||




Expand Down
16 changes: 14 additions & 2 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,16 @@ which also contains information on available datasets, and which have already be

The datasets available are listed below:

{% include-markdown "./includes/generated_files/datasets_table.md" %}
|dataset name|description|rows|unique entities|link to source|
|-|-|-|-|-|
|`fake_1000`|Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled.|1,000|250|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/fake_1000.csv)|
|`historical_50k`|The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors.|50,000|5,156|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/historical_figures_with_errors_50k.parquet)|
|`febrl3`|The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record.|5,000|2,000|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/febrl/dataset3.csv)|
|`febrl4a`|The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records.|5,000|5,000|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/febrl/dataset4a.csv)|
|`febrl4b`|The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a.|5,000|5,000|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/febrl/dataset4b.csv)|
|`transactions_origin`|This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing.|45,326|45,326|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/transactions_origin.parquet)|
|`transactions_destination`|This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing.|45,326|45,326|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/transactions_destination.parquet)|



## `splink_dataset_labels`
Expand All @@ -59,7 +68,10 @@ Some of the `splink_datasets` have corresponding clerical labels to help assess

The datasets available are listed below:

{% include-markdown "./includes/generated_files/dataset_labels_table.md" %}
|dataset name|description|rows|unique entities|link to source|
|-|-|-|-|-|
|`fake_1000_labels`|Clerical labels for fake_1000 |3,176|NA|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/fake_1000_labels.csv)|



## `splink_dataset_utils` API
Expand Down
3 changes: 0 additions & 3 deletions docs/topic_guides/evaluation/clusters.md

This file was deleted.

91 changes: 91 additions & 0 deletions docs/topic_guides/evaluation/clusters/graph_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Graph metrics

Graph metrics quantify the characteristics of a graph. A simple example of a graph metric is [cluster size](), which is the number of nodes within a cluster.

For data linking with Splink, it is useful to sort graph metrics into three categories:

* [Node metrics]()
* [Edge metrics]()
* [Cluster metrics]()

Each of these are defined below together with examples and explanations of how they can be applied to linked data to evaluate cluster quality. The examples given are of all metrics currently available in Splink.

!!! note

It is important to bear in mind that whilst graph metrics can be very useful for assessing linkage quality, they are rarely definitive, especially when taken in isolation. A more comprehensive picture can be built by considering various metrics in conjunction with one another.

It is also important to consider metrics within the context of their distribution and the underlying dataset. For example: a cluster density (see below) of 0.4 might seem low but could actually be above average for the dataset in question; a cluster of size 80 might be suspiciously large for one dataset but not for another.


## ⚫️ Node metrics

Node metrics quantify the properties of the nodes which live within clusters.

### Example: node degree

Node degree is the number of edges connected to a node.

High node degree is generally considered good as it means there are many edges in support of records in a cluster being linked. Nodes with low node degree could indicate links being missed (false negatives).
zslade marked this conversation as resolved.
Show resolved Hide resolved

However, erroneous links (false positives) could also be the reason for high node degree, so it can be useful to validate the edges of highly connected nodes.

It is important to consider [cluster size]() when looking at node degree. By definition, larger clusters contain more nodes to form links between, allowing nodes within them to attain higher degrees compared to those in smaller clusters. Consequently, low node degree within larger clusters can carry greater significance.

Bear in mind, that the degree of a single node in a cluster isn't necessarily representative of the overall connectedness of a cluster. This is where [cluster centralisation]() can help.

## 🔗 Edge metrics

Edge metrics quantify the properties of the edges within a cluster.

### Example: 'is bridge'

An edge is classified as a 'bridge' if its removal splits a cluster into two smaller clusters.

[insert picture]

Bridges can be signalers of false positives in linked data, especially when joining two highly connected sub-clusters. Examining bridges can shed light on potential errors in the linking process leading to the formation of false positive links.

## :fontawesome-solid-circle-nodes: Cluster metrics

Cluster metrics refer to the characteristics of a cluster as a whole, rather than the individual nodes and edges it contains.

### Example: cluster size

Cluster size refers to the number of nodes within a cluster.

When thinking about cluster size, it is important to consider the size of the biggest clusters produced and ask yourself, does this seem reasonable for the dataset being linked? For example, does it make sense that one person is appearing hundreds of times in the linked data resulting in a cluster of over 100 nodes? If the answer is no, then false positives links are probably being formed. This could be, for example, due to having blocking rules which are too loose.
zslade marked this conversation as resolved.
Show resolved Hide resolved

If you don't have an intuition of what seems reasonable, then it is worth inspecting a sample of the largest clusters in Splink's [Cluster Studio Dashboard]() to validate (or invalidate) links. From there you can develop an understanding of what maximum cluster size to expect.
zslade marked this conversation as resolved.
Show resolved Hide resolved

There also might be a lower bound on cluster size. For example, when linking two datasets in which you know people appear least once in each, the minimum expected size of cluster will be 2. Clusters smaller than the minimum size indicate links have been missed. This could be due to blocking rules not letting through all record comparisons of true matches.

### Example: cluster density

The density of a cluster is given by the number of edges it contains divided by the maximum possible number of edges. Density ranges from 0 to 1. A density of 1 means that all nodes are connected to all other nodes in a cluster.

[picture: edges vs max possible edges]

When evaluating clusters, a high density (closer to 1) is generally considered good as it means there are many edges in support of the records in a cluster being linked.

A low density could indicate links being missed. This could happen, for example, if blocking rules are too tight or the clustering threshold is too high.
zslade marked this conversation as resolved.
Show resolved Hide resolved
A sample of low density clusters can be inspected in Splink [Cluster Studio Dashboard]() via the option `sampling_method = "lowest_density_clusters_by_size"`. When inspecting a cluster, ask yourself the question: why aren't more links being formed between record nodes?

Bear in mind, small clusters are more likely to achieve a higher density as fewer record comparisons are required to form the maximum edges possible (the maximum density of 1 for a cluster of size 3 can be achieved with only 3 pairwise record comparisons).
zslade marked this conversation as resolved.
Show resolved Hide resolved

Therefore it's important to consider a range of sizes when evaluating density to ensure you're not just focussed on very big clusters. Smaller clusters also have the advantage of being easier to assess by eye. This is why the option `sampling_method = "lowest_density_clusters_by_size"` performs stratified sampling across different cluster sizes.

<!-- With each increase in N, the number of possible edges increases. It might be 'harder' for bigger clusters to attain a higher density because blocking rules may prevent all record comparisons of nodes within a cluster. -->

### Example: cluster centralisation

[Cluster centralisation]("https://en.wikipedia.org/wiki/Centrality#Degree_centrality") is defined as the deviation from maximum [node degree]() normalised with respect to the maximum possible value. In other words, cluster centralisation tells us about the concentration of edges in a cluster. Centralisation ranges from 0 to 1.

A high cluster centralisation (closer to 1) indicates that a few nodes are home to significantly more connections compared to the rest of the nodes in a cluster. This can help identify clusters containing nodes with a lower number of connections (low node degree) relative to what is possible for that cluster.
zslade marked this conversation as resolved.
Show resolved Hide resolved

Low centralisation suggests that edges are more evenly distributed amongst nodes in a cluster. Low centralisation can be good if all nodes within a clusters enjoy many connections. However low centralisation could also indicate that all nodes are not as highly connected as they could be. To check for this, you can look at low centralisation in conjunction with low [node degree]() or [low density]().

[maybe include a picture to help aid understanding]

<br>

A guide on [how to compute all the graph metrics mentioned above with Splink]() is given in the next chapter.
42 changes: 42 additions & 0 deletions docs/topic_guides/evaluation/clusters/how_to_compute_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# How to compute graph metrics with Splink

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the `compute_graph_metrics()` method.

The method is called on the `linker` like so:

```
linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)
zslade marked this conversation as resolved.
Show resolved Hide resolved
```
with arguments

Args:
df_predict (SplinkDataFrame): The results of `linker.predict()`
df_clustered (SplinkDataFrame): The outputs of
`linker.cluster_pairwise_predictions_at_threshold()`
threshold_match_probability (float): Filter the pairwise match predictions
to include only pairwise comparisons with a match_probability at or
above this threshold.

!!! warning

`threshold_match_probability` should be the same as the clustering threshold passed to `cluster_pairwise_predictions_at_threshold()`. If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.

The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:
zslade marked this conversation as resolved.
Show resolved Hide resolved

```
compute_graph_metrics.nodes for node metrics
compute_graph_metrics.edges for edge metrics
compute_graph_metrics.clusters for cluster metrics
```

The metrics computed by `compute_graph_metrics()` include all those mentioned in the [Graph metrics]() chapter, namely:

* Node degree
* 'Is bridge'
* Cluster size
* Cluster density
* Cluster centrality

All of these metrics are calculated by default. If you are unable to install the `igraph` package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.

This topic guide is a work in progress. Please check back for more detailed examples of how `compute_graph_metrics()` can be used to evaluate linked data.
Loading
Loading