moj-analytical-services · RossKen · Apr 13, 2023 · Mar 27, 2023 · Apr 2, 2023 · Apr 3, 2023
diff --git a/docs/topic_guides/comparison_templates.ipynb b/docs/topic_guides/comparison_templates.ipynb
@@ -0,0 +1,385 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Out-of-the-box Comparisons for specific data types\n",
+    "\n",
+    "Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate read-made comparisons for a variety of data types.\n",
+    "\n",
+    "Below are examples of how to structure comparisons for a variety of data types.\n",
+    "\n",
+    "<hr>"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Date Comparisons\n",
+    "\n",
+    "Date comparisons are generally structured as: \n",
+    "\n",
+    "- Null level  \n",
+    "- Exact match  \n",
+    "- Fuzzy match ([using metric of choice](comparators.md))  \n",
+    "- Interval match (within X days/months/years)  \n",
+    "- Else level\n",
+    "\n",
+    "The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from splink.duckdb.duckdb_comparison_template_library import date_comparison\n",
+    "\n",
+    "date_of_birth_comparison = date_comparison(\"date_of_birth\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Gives a comparison structured as follows:\n",
+    "\n",
+    "```\n",
+    "Comparison: Date of birth\n",
+    "├─-- ComparisonLevel: Exact match\n",
+    "├─-- ComparisonLevel: Up to one character difference\n",
+    "├─-- ComparisonLevel: Up to two character difference\n",
+    "├─-- ComparisonLevel: Dates within 1 year of each other\n",
+    "├─-- ComparisonLevel: Dates within 10 years of each other\n",
+    "├─-- ComparisonLevel: All other\n",
+    "```\n",
+    "\n",
+    "Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Comparison 'Exact match vs. Dates within levenshtein thresholds 1, 2 vs. Dates within the following thresholds Year(s): 1, Year(s): 10 vs. anything else' of \"date_of_birth\".\n",
+      "Similarity is assessed using the following ComparisonLevels:\n",
+      "    - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
+      "    - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
+      "    - 'Levenshtein <= 1' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1\n",
+      "    - 'Levenshtein <= 2' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 2\n",
+      "    - 'Within 1 year' with SQL rule: \n",
+      "        abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
+      "    \n",
+      "    - 'Within 10 years' with SQL rule: \n",
+      "        abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 10\n",
+      "    \n",
+      "    - 'All other comparisons' with SQL rule: ELSE\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(date_of_birth_comparison.human_readable_description)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
+    "\n",
+    "For example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Comparison 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else' of \"date_of_birth\".\n",
+      "Similarity is assessed using the following ComparisonLevels:\n",
+      "    - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
+      "    - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
+      "    - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88\n",
+      "    - 'Within 1 month' with SQL rule: \n",
+      "        abs(date_diff('month', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
+      "    \n",
+      "    - 'Within 1 year' with SQL rule: \n",
+      "        abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
+      "    \n",
+      "    - 'All other comparisons' with SQL rule: ELSE\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "date_of_birth_comparison = date_comparison(\n",
+    "    \"date_of_birth\",\n",
+    "    levenshtein_thresholds=[],\n",
+    "    jaro_winkler_thresholds=[0.88],\n",
+    "    datediff_thresholds=[1, 1],\n",
+    "    datediff_metrics=[\"month\", \"year\"],\n",
+    ")\n",
+    "print(date_of_birth_comparison.human_readable_description)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To see this as a specifications dictionary you can call"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'output_column_name': 'date_of_birth',\n",
+       " 'comparison_levels': [{'sql_condition': '\"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL',\n",
+       "   'label_for_charts': 'Null',\n",
+       "   'is_null_level': True},\n",
+       "  {'sql_condition': '\"date_of_birth_l\" = \"date_of_birth_r\"',\n",
+       "   'label_for_charts': 'Exact match'},\n",
+       "  {'sql_condition': 'jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88',\n",
+       "   'label_for_charts': 'Jaro_winkler_similarity >= 0.88'},\n",
+       "  {'sql_condition': '\\n        abs(date_diff(\\'month\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n    ',\n",
+       "   'label_for_charts': 'Within 1 month'},\n",
+       "  {'sql_condition': '\\n        abs(date_diff(\\'year\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n    ',\n",
+       "   'label_for_charts': 'Within 1 year'},\n",
+       "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
+       " 'comparison_description': 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else'}"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "date_of_birth_comparison.as_dict()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.\n",
+    "\n",
+    "<hr>"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Name Comparisons\n",
+    "\n",
+    "Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: \n",
+    "\n",
+    "- Null level  \n",
+    "- Exact match  \n",
+    "- Fuzzy match ([using metric of choice](comparators.md))  \n",
+    "- Else level\n",
+    "\n",
+    "The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from splink.duckdb.duckdb_comparison_template_library import name_comparison\n",
+    "\n",
+    "first_name_comparison = name_comparison(\"first_name\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Gives a comparison structured as follows:\n",
+    "\n",
+    "```\n",
+    "Comparison: Date of birth\n",
+    "├─-- ComparisonLevel: Exact match\n",
+    "├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.95\n",
+    "├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.88\n",
+    "├─-- ComparisonLevel: All other\n",
+    "```\n",
+    "\n",
+    "Or, using `human_readable_description` to generate automatically from `first_name_comparison`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Comparison 'Exact match vs. Names within jaro_winkler thresholds 0.95, 0.88 vs. anything else' of \"first_name\".\n",
+      "Similarity is assessed using the following ComparisonLevels:\n",
+      "    - 'Null' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
+      "    - 'Exact match first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
+      "    - 'Jaro_winkler_similarity >= 0.95' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.95\n",
+      "    - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n",
+      "    - 'All other comparisons' with SQL rule: ELSE\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(first_name_comparison.human_readable_description)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
+    "\n",
+    "For example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Comparison 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else' of \"surname\" and \"surname_dm\".\n",
+      "Similarity is assessed using the following ComparisonLevels:\n",
+      "    - 'Null' with SQL rule: \"surname_l\" IS NULL OR \"surname_r\" IS NULL\n",
+      "    - 'Exact match surname' with SQL rule: \"surname_l\" = \"surname_r\"\n",
+      "    - 'Exact match surname_dm' with SQL rule: \"surname_dm_l\" = \"surname_dm_r\"\n",
+      "    - 'Levenshtein <= 2' with SQL rule: levenshtein(\"surname_l\", \"surname_r\") <= 2\n",
+      "    - 'Jaccard >= 1' with SQL rule: jaccard(\"surname_l\", \"surname_r\") >= 1\n",
+      "    - 'All other comparisons' with SQL rule: ELSE\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "surname_comparison = name_comparison(\n",
+    "    \"surname\",\n",
+    "    phonetic_col_name=\"surname_dm\",\n",
+    "    term_frequency_adjustments_name=True,\n",
+    "    levenshtein_thresholds=[2],\n",
+    "    jaro_winkler_thresholds=[],\n",
+    "    jaccard_thresholds=[1],\n",
+    ")\n",
+    "print(surname_comparison.human_readable_description)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To see this as a specifications dictionary you can call"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'output_column_name': 'custom_surname_surname_dm',\n",
+       " 'comparison_levels': [{'sql_condition': '\"surname_l\" IS NULL OR \"surname_r\" IS NULL',\n",
+       "   'label_for_charts': 'Null',\n",
+       "   'is_null_level': True},\n",
+       "  {'sql_condition': '\"surname_l\" = \"surname_r\"',\n",
+       "   'label_for_charts': 'Exact match surname',\n",
+       "   'tf_adjustment_column': 'surname',\n",
+       "   'tf_adjustment_weight': 1.0},\n",
+       "  {'sql_condition': '\"surname_dm_l\" = \"surname_dm_r\"',\n",
+       "   'label_for_charts': 'Exact match surname_dm'},\n",
+       "  {'sql_condition': 'levenshtein(\"surname_l\", \"surname_r\") <= 2',\n",
+       "   'label_for_charts': 'Levenshtein <= 2'},\n",
+       "  {'sql_condition': 'jaccard(\"surname_l\", \"surname_r\") >= 1',\n",
+       "   'label_for_charts': 'Jaccard >= 1'},\n",
+       "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
+       " 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else'}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "surname_comparison.as_dict()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}