Merge f87b4a2 into fa7aea7

moj-analytical-services · Apr 12, 2023 · ffe2173 · ffe2173
2 parents fa7aea7 + f87b4a2
commit ffe2173
Show file tree

Hide file tree

Showing 5 changed files with 719 additions and 354 deletions.
diff --git a/docs/topic_guides/comparison_templates.ipynb b/docs/topic_guides/comparison_templates.ipynb
@@ -0,0 +1,385 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Out-of-the-box Comparisons for specific data types\n",
+ "\n",
+ "Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate read-made comparisons for a variety of data types.\n",
+ "\n",
+ "Below are examples of how to structure comparisons for a variety of data types.\n",
+ "\n",
+ "<hr>"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Date Comparisons\n",
+ "\n",
+ "Date comparisons are generally structured as: \n",
+ "\n",
+ "- Null level \n",
+ "- Exact match \n",
+ "- Fuzzy match ([using metric of choice](comparators.md)) \n",
+ "- Interval match (within X days/months/years) \n",
+ "- Else level\n",
+ "\n",
+ "The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from splink.duckdb.duckdb_comparison_template_library import date_comparison\n",
+ "\n",
+ "date_of_birth_comparison = date_comparison(\"date_of_birth\")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Gives a comparison structured as follows:\n",
+ "\n",
+ "```\n",
+ "Comparison: Date of birth\n",
+ "├─-- ComparisonLevel: Exact match\n",
+ "├─-- ComparisonLevel: Up to one character difference\n",
+ "├─-- ComparisonLevel: Up to two character difference\n",
+ "├─-- ComparisonLevel: Dates within 1 year of each other\n",
+ "├─-- ComparisonLevel: Dates within 10 years of each other\n",
+ "├─-- ComparisonLevel: All other\n",
+ "```\n",
+ "\n",
+ "Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Comparison 'Exact match vs. Dates within levenshtein thresholds 1, 2 vs. Dates within the following thresholds Year(s): 1, Year(s): 10 vs. anything else' of \"date_of_birth\".\n",
+ "Similarity is assessed using the following ComparisonLevels:\n",
+ " - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
+ " - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
+ " - 'Levenshtein <= 1' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1\n",
+ " - 'Levenshtein <= 2' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 2\n",
+ " - 'Within 1 year' with SQL rule: \n",
+ " abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
+ " \n",
+ " - 'Within 10 years' with SQL rule: \n",
+ " abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 10\n",
+ " \n",
+ " - 'All other comparisons' with SQL rule: ELSE\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(date_of_birth_comparison.human_readable_description)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
+ "\n",
+ "For example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Comparison 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else' of \"date_of_birth\".\n",
+ "Similarity is assessed using the following ComparisonLevels:\n",
+ " - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
+ " - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
+ " - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88\n",
+ " - 'Within 1 month' with SQL rule: \n",
+ " abs(date_diff('month', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
+ " \n",
+ " - 'Within 1 year' with SQL rule: \n",
+ " abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
+ " \n",
+ " - 'All other comparisons' with SQL rule: ELSE\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "date_of_birth_comparison = date_comparison(\n",
+ " \"date_of_birth\",\n",
+ " levenshtein_thresholds=[],\n",
+ " jaro_winkler_thresholds=[0.88],\n",
+ " datediff_thresholds=[1, 1],\n",
+ " datediff_metrics=[\"month\", \"year\"],\n",
+ ")\n",
+ "print(date_of_birth_comparison.human_readable_description)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To see this as a specifications dictionary you can call"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'output_column_name': 'date_of_birth',\n",
+ " 'comparison_levels': [{'sql_condition': '\"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL',\n",
+ " 'label_for_charts': 'Null',\n",
+ " 'is_null_level': True},\n",
+ " {'sql_condition': '\"date_of_birth_l\" = \"date_of_birth_r\"',\n",
+ " 'label_for_charts': 'Exact match'},\n",
+ " {'sql_condition': 'jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88',\n",
+ " 'label_for_charts': 'Jaro_winkler_similarity >= 0.88'},\n",
+ " {'sql_condition': '\\n abs(date_diff(\\'month\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n ',\n",
+ " 'label_for_charts': 'Within 1 month'},\n",
+ " {'sql_condition': '\\n abs(date_diff(\\'year\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n ',\n",
+ " 'label_for_charts': 'Within 1 year'},\n",
+ " {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
+ " 'comparison_description': 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else'}"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "date_of_birth_comparison.as_dict()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.\n",
+ "\n",
+ "<hr>"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Name Comparisons\n",
+ "\n",
+ "Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: \n",
+ "\n",
+ "- Null level \n",
+ "- Exact match \n",
+ "- Fuzzy match ([using metric of choice](comparators.md)) \n",
+ "- Else level\n",
+ "\n",
+ "The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from splink.duckdb.duckdb_comparison_template_library import name_comparison\n",
+ "\n",
+ "first_name_comparison = name_comparison(\"first_name\")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Gives a comparison structured as follows:\n",
+ "\n",
+ "```\n",
+ "Comparison: Date of birth\n",
+ "├─-- ComparisonLevel: Exact match\n",
+ "├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.95\n",
+ "├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.88\n",
+ "├─-- ComparisonLevel: All other\n",
+ "```\n",
+ "\n",
+ "Or, using `human_readable_description` to generate automatically from `first_name_comparison`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Comparison 'Exact match vs. Names within jaro_winkler thresholds 0.95, 0.88 vs. anything else' of \"first_name\".\n",
+ "Similarity is assessed using the following ComparisonLevels:\n",
+ " - 'Null' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
+ " - 'Exact match first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
+ " - 'Jaro_winkler_similarity >= 0.95' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.95\n",
+ " - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n",
+ " - 'All other comparisons' with SQL rule: ELSE\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(first_name_comparison.human_readable_description)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
+ "\n",
+ "For example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Comparison 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else' of \"surname\" and \"surname_dm\".\n",
+ "Similarity is assessed using the following ComparisonLevels:\n",
+ " - 'Null' with SQL rule: \"surname_l\" IS NULL OR \"surname_r\" IS NULL\n",
+ " - 'Exact match surname' with SQL rule: \"surname_l\" = \"surname_r\"\n",
+ " - 'Exact match surname_dm' with SQL rule: \"surname_dm_l\" = \"surname_dm_r\"\n",
+ " - 'Levenshtein <= 2' with SQL rule: levenshtein(\"surname_l\", \"surname_r\") <= 2\n",
+ " - 'Jaccard >= 1' with SQL rule: jaccard(\"surname_l\", \"surname_r\") >= 1\n",
+ " - 'All other comparisons' with SQL rule: ELSE\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "surname_comparison = name_comparison(\n",
+ " \"surname\",\n",
+ " phonetic_col_name=\"surname_dm\",\n",
+ " term_frequency_adjustments_name=True,\n",
+ " levenshtein_thresholds=[2],\n",
+ " jaro_winkler_thresholds=[],\n",
+ " jaccard_thresholds=[1],\n",
+ ")\n",
+ "print(surname_comparison.human_readable_description)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To see this as a specifications dictionary you can call"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'output_column_name': 'custom_surname_surname_dm',\n",
+ " 'comparison_levels': [{'sql_condition': '\"surname_l\" IS NULL OR \"surname_r\" IS NULL',\n",
+ " 'label_for_charts': 'Null',\n",
+ " 'is_null_level': True},\n",
+ " {'sql_condition': '\"surname_l\" = \"surname_r\"',\n",
+ " 'label_for_charts': 'Exact match surname',\n",
+ " 'tf_adjustment_column': 'surname',\n",
+ " 'tf_adjustment_weight': 1.0},\n",
+ " {'sql_condition': '\"surname_dm_l\" = \"surname_dm_r\"',\n",
+ " 'label_for_charts': 'Exact match surname_dm'},\n",
+ " {'sql_condition': 'levenshtein(\"surname_l\", \"surname_r\") <= 2',\n",
+ " 'label_for_charts': 'Levenshtein <= 2'},\n",
+ " {'sql_condition': 'jaccard(\"surname_l\", \"surname_r\") >= 1',\n",
+ " 'label_for_charts': 'Jaccard >= 1'},\n",
+ " {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
+ " 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else'}"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "surname_comparison.as_dict()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "base",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.12"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}