moj-analytical-services · RossKen · Apr 4, 2024 · Jan 24, 2024 · Jan 24, 2024 · Jan 30, 2024
diff --git a/docs/comparison_level_composition.md b/docs/comparison_level_composition.md
@@ -13,7 +13,12 @@ For example, `or_(null_level("first_name"), null_level("surname"))` creates a ch
 
 The Splink comparison level composition functions available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_composition_library_dialect_table.md" %}
+||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
+|:-:|:-:|:-:|:-:|:-:|:-:|
+|[and_](#splink.comparison_level_composition.and_)|✓|✓|✓|✓|✓|
+|[not_](#splink.comparison_level_composition.not_)|✓|✓|✓|✓|✓|
+|[or_](#splink.comparison_level_composition.or_)|✓|✓|✓|✓|✓|
+
 
 
 

diff --git a/docs/comparison_level_library.md b/docs/comparison_level_library.md
@@ -21,7 +21,23 @@ However, not every comparison level is available for every [Splink-compatible SQ
 
 The pre-made Splink comparison levels available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_level_library_dialect_table.md" %}
+||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
+|:-:|:-:|:-:|:-:|:-:|:-:|
+|[array_intersect_level](#splink.comparison_level_library.ArrayIntersectLevelBase)|✓|✓|✓||✓|
+|[columns_reversed_level](#splink.comparison_level_library.ColumnsReversedLevelBase)|✓|✓|✓|✓|✓|
+|[damerau_levenshtein_level](#splink.comparison_level_library.DamerauLevenshteinLevelBase)|✓|✓||✓||
+|[datediff_level](#splink.comparison_level_library.DatediffLevelBase)|✓|✓|✓||✓|
+|[distance_function_level](#splink.comparison_level_library.DistanceFunctionLevelBase)|✓|✓|✓|✓|✓|
+|[distance_in_km_level](#splink.comparison_level_library.DistanceInKmLevelBase)|✓|✓|✓||✓|
+|[else_level](#splink.comparison_level_library.ElseLevelBase)|✓|✓|✓|✓|✓|
+|[exact_match_level](#splink.comparison_level_library.ExactMatchLevelBase)|✓|✓|✓|✓|✓|
+|[jaccard_level](#splink.comparison_level_library.JaccardLevelBase)|✓|✓||||
+|[jaro_level](#splink.comparison_level_library.JaroLevelBase)|✓|✓||✓||
+|[jaro_winkler_level](#splink.comparison_level_library.JaroWinklerLevelBase)|✓|✓||✓||
+|[levenshtein_level](#splink.comparison_level_library.LevenshteinLevelBase)|✓|✓|✓|✓|✓|
+|[null_level](#splink.comparison_level_library.NullLevelBase)|✓|✓|✓|✓|✓|
+|[percentage_difference_level](#splink.comparison_level_library.PercentageDifferenceLevelBase)|✓|✓|✓|✓|✓|
+
 
 
 

diff --git a/docs/comparison_library.md b/docs/comparison_library.md
@@ -17,7 +17,19 @@ However, not every comparison is available for every [Splink-compatible SQL back
 
 The pre-made Splink comparisons available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_library_dialect_table.md" %}
+||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
+|:-:|:-:|:-:|:-:|:-:|:-:|
+|[array_intersect_at_sizes](#splink.comparison_library.ArrayIntersectAtSizesBase)|✓|✓|✓||✓|
+|[damerau_levenshtein_at_thresholds](#splink.comparison_library.DamerauLevenshteinAtThresholdsBase)|✓|✓||✓||
+|[datediff_at_thresholds](#splink.comparison_library.DatediffAtThresholdsBase)|✓|✓|✓||✓|
+|[distance_function_at_thresholds](#splink.comparison_library.DistanceFunctionAtThresholdsBase)|✓|✓|✓|✓|✓|
+|[distance_in_km_at_thresholds](#splink.comparison_library.DistanceInKmAtThresholdsBase)|✓|✓|✓||✓|
+|[exact_match](#splink.comparison_library.ExactMatchBase)|✓|✓|✓|✓|✓|
+|[jaccard_at_thresholds](#splink.comparison_library.JaccardAtThresholdsBase)|✓|✓||||
+|[jaro_at_thresholds](#splink.comparison_library.JaroAtThresholdsBase)|✓|✓||✓||
+|[jaro_winkler_at_thresholds](#splink.comparison_library.JaroWinklerAtThresholdsBase)|✓|✓||✓||
+|[levenshtein_at_thresholds](#splink.comparison_library.LevenshteinAtThresholdsBase)|✓|✓|✓|✓|✓|
+
 
 
 

diff --git a/docs/comparison_template_library.md b/docs/comparison_template_library.md
@@ -13,7 +13,14 @@ However, not every comparison is available for every [Splink-compatible SQL back
 
 The pre-made Splink comparison templates available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_template_library_dialect_table.md" %}
+||:simple-duckdb: <br> DuckDB|:simple-apachespark: <br> Spark|:simple-amazonaws: <br> Athena|:simple-sqlite: <br> SQLite|:simple-postgresql: <br> PostgreSql|
+|:-:|:-:|:-:|:-:|:-:|:-:|
+|[date_comparison](#splink.comparison_template_library.DateComparisonBase)|✓|✓||||
+|[email_comparison](#splink.comparison_template_library.EmailComparisonBase)|✓|✓||||
+|[forename_surname_comparison](#splink.comparison_template_library.ForenameSurnameComparisonBase)|✓|✓||✓||
+|[name_comparison](#splink.comparison_template_library.NameComparisonBase)|✓|✓||✓||
+|[postcode_comparison](#splink.comparison_template_library.PostcodeComparisonBase)|✓|✓|✓|||
+
 
 
 

diff --git a/docs/datasets.md b/docs/datasets.md
@@ -48,7 +48,16 @@ which also contains information on available datasets, and which have already be
 
 The datasets available are listed below:
 
-{% include-markdown "./includes/generated_files/datasets_table.md" %}
+|dataset name|description|rows|unique entities|link to source|
+|-|-|-|-|-|
+|`fake_1000`|Fake 1000 from splink demos.  Records are 250 simulated people, with different numbers of duplicates, labelled.|1,000|250|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/fake_1000.csv)|
+|`historical_50k`|The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors.|50,000|5,156|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/historical_figures_with_errors_50k.parquet)|
+|`febrl3`|The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record.|5,000|2,000|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/febrl/dataset3.csv)|
+|`febrl4a`|The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records.|5,000|5,000|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/febrl/dataset4a.csv)|
+|`febrl4b`|The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a.|5,000|5,000|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/febrl/dataset4b.csv)|
+|`transactions_origin`|This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing.|45,326|45,326|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/transactions_origin.parquet)|
+|`transactions_destination`|This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing.|45,326|45,326|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/transactions_destination.parquet)|
+
 
 
 ## `splink_dataset_labels`
@@ -59,7 +68,10 @@ Some of the `splink_datasets` have corresponding clerical labels to help assess
 
 The datasets available are listed below:
 
-{% include-markdown "./includes/generated_files/dataset_labels_table.md" %}
+|dataset name|description|rows|unique entities|link to source|
+|-|-|-|-|-|
+|`fake_1000_labels`|Clerical labels for fake_1000 |3,176|NA|[source](https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/fake_1000_labels.csv)|
+
 
 
 ## `splink_dataset_utils` API

diff --git a/docs/topic_guides/evaluation/clusters.md b/docs/topic_guides/evaluation/clusters.md
diff --git a/docs/topic_guides/evaluation/clusters/graph_metrics.md b/docs/topic_guides/evaluation/clusters/graph_metrics.md
@@ -0,0 +1,91 @@
+# Graph metrics
+
+Graph metrics quantify the characteristics of a graph. A simple example of a graph metric is [cluster size](), which is the number of nodes within a cluster.
+
+For data linking with Splink, it is useful to sort graph metrics into three categories:
+
+* [Node metrics]()
+* [Edge metrics]()
+* [Cluster metrics]()
+
+Each of these are defined below together with examples and explanations of how they can be applied to linked data to evaluate cluster quality. The examples given are of all metrics currently available in Splink.
+
+!!! note
+
+    It is important to bear in mind that whilst graph metrics can be very useful for assessing linkage quality, they are rarely definitive, especially when taken in isolation. A more comprehensive picture can be built by considering various metrics in conjunction with one another.
+
+    It is also important to consider metrics within the context of their distribution and the underlying dataset. For example: a cluster density (see below) of 0.4 might seem low but could actually be above average for the dataset in question; a cluster of size 80 might be suspiciously large for one dataset but not for another.
+
+
+## ⚫️ Node metrics
+
+Node metrics quantify the properties of the nodes which live within clusters.
+
+### Example: node degree
+
+Node degree is the number of edges connected to a node.
+
+High node degree is generally considered good as it means there are many edges in support of records in a cluster being linked. Nodes with low node degree could indicate links being missed (false negatives).
+
+However, erroneous links (false positives) could also be the reason for high node degree, so it can be useful to validate the edges of highly connected nodes.
+
+It is important to consider [cluster size]() when looking at node degree. By definition, larger clusters contain more nodes to form links between, allowing nodes within them to attain higher degrees compared to those in smaller clusters. Consequently, low node degree within larger clusters can carry greater significance.
+
+Bear in mind, that the degree of a single node in a cluster isn't necessarily representative of the overall connectedness of a cluster. This is where [cluster centralisation]() can help.
+
+## 🔗 Edge metrics
+
+Edge metrics quantify the properties of the edges within a cluster. 
+
+### Example: 'is bridge'
+
+An edge is classified as a 'bridge' if its removal splits a cluster into two smaller clusters.
+
+[insert picture]
+
+Bridges can be signalers of false positives in linked data, especially when joining two highly connected sub-clusters. Examining bridges can shed light on potential errors in the linking process leading to the formation of false positive links.
+
+## :fontawesome-solid-circle-nodes: Cluster metrics
+
+Cluster metrics refer to the characteristics of a cluster as a whole, rather than the individual nodes and edges it contains.
+
+### Example: cluster size
+
+Cluster size refers to the number of nodes within a cluster.
+
+When thinking about cluster size, it is important to consider the size of the biggest clusters produced and ask yourself, does this seem reasonable for the dataset being linked? For example, does it make sense that one person is appearing hundreds of times in the linked data resulting in a cluster of over 100 nodes? If the answer is no, then false positives links are probably being formed. This could be, for example, due to having blocking rules which are too loose.
+
+If you don't have an intuition of what seems reasonable, then it is worth inspecting a sample of the largest clusters in Splink's [Cluster Studio Dashboard]() to validate (or invalidate) links. From there you can develop an understanding of what maximum cluster size to expect.
+
+There also might be a lower bound on cluster size. For example, when linking two datasets in which you know people appear least once in each, the minimum expected size of cluster will be 2. Clusters smaller than the minimum size indicate links have been missed. This could be due to blocking rules not letting through all record comparisons of true matches.
+
+### Example: cluster density
+
+The density of a cluster is given by the number of edges it contains divided by the maximum possible number of edges. Density ranges from 0 to 1. A density of 1 means that all nodes are connected to all other nodes in a cluster.
+
+[picture: edges vs max possible edges]
+
+When evaluating clusters, a high density (closer to 1) is generally considered good as it means there are many edges in support of the records in a cluster being linked.
+
+A low density could indicate links being missed. This could happen, for example, if blocking rules are too tight or the clustering threshold is too high.
+A sample of low density clusters can be inspected in Splink [Cluster Studio Dashboard]() via the option `sampling_method = "lowest_density_clusters_by_size"`. When inspecting a cluster, ask yourself the question: why aren't more links being formed between record nodes?
+
+Bear in mind, small clusters are more likely to achieve a higher density as fewer record comparisons are required to form the maximum edges possible (the maximum density of 1 for a cluster of size 3 can be achieved with only 3 pairwise record comparisons).
+
+Therefore it's important to consider a range of sizes when evaluating density to ensure you're not just focussed on very big clusters. Smaller clusters also have the advantage of being easier to assess by eye. This is why the option `sampling_method = "lowest_density_clusters_by_size"` performs stratified sampling across different cluster sizes.
+
+<!-- With each increase in N, the number of possible edges increases. It might be 'harder' for bigger clusters to attain a higher density because blocking rules may prevent all record comparisons of nodes within a cluster. -->
+
+### Example: cluster centralisation
+
+[Cluster centralisation]("https://en.wikipedia.org/wiki/Centrality#Degree_centrality") is defined as the deviation from maximum [node degree]() normalised with respect to the maximum possible value. In other words, cluster centralisation tells us about the concentration of edges in a cluster. Centralisation ranges from 0 to 1.
+
+A high cluster centralisation (closer to 1) indicates that a few nodes are home to significantly more connections compared to the rest of the nodes in a cluster. This can help identify clusters containing nodes with a lower number of connections (low node degree) relative to what is possible for that cluster. 
+
+Low centralisation suggests that edges are more evenly distributed amongst nodes in a cluster. Low centralisation can be good if all nodes within a clusters enjoy many connections. However low centralisation could also indicate that all nodes are not as highly connected as they could be. To check for this, you can look at low centralisation in conjunction with low [node degree]() or [low density]().
+
+[maybe include a picture to help aid understanding]
+
+<br>
+
+A guide on [how to compute all the graph metrics mentioned above with Splink]() is given in the next chapter.
diff --git a/docs/topic_guides/evaluation/clusters/how_to_compute_metrics.md b/docs/topic_guides/evaluation/clusters/how_to_compute_metrics.md
@@ -0,0 +1,42 @@
+# How to compute graph metrics with Splink
+
+To enable users to calculate a variety of graph metrics for their linked data, Splink provides the `compute_graph_metrics()` method.
+
+The method is called on the `linker` like so:
+
+```
+linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)
+```
+with arguments
+
+    Args:
+        df_predict (SplinkDataFrame): The results of `linker.predict()`
+        df_clustered (SplinkDataFrame): The outputs of
+            `linker.cluster_pairwise_predictions_at_threshold()`
+        threshold_match_probability (float): Filter the pairwise match predictions
+            to include only pairwise comparisons with a match_probability at or
+            above this threshold.
+
+!!! warning
+
+    `threshold_match_probability` should be the same as the clustering threshold passed to `cluster_pairwise_predictions_at_threshold()`. If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.
+
+The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:
+
+```
+compute_graph_metrics.nodes for node metrics
+compute_graph_metrics.edges for edge metrics
+compute_graph_metrics.clusters for cluster metrics
+```
+
+The metrics computed by `compute_graph_metrics()` include all those mentioned in the [Graph metrics]() chapter, namely:
+
+* Node degree
+* 'Is bridge'
+* Cluster size
+* Cluster density
+* Cluster centrality
+
+All of these metrics are calculated by default. If you are unable to install the `igraph` package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.
+
+This topic guide is a work in progress. Please check back for more detailed examples of how `compute_graph_metrics()` can be used to evaluate linked data.