moj-analytical-services · sama-ds · Jun 12, 2023 · Jun 1, 2023 · Jun 1, 2023 · Jun 2, 2023
diff --git a/docs/comparison_level_library.md b/docs/comparison_level_library.md
@@ -21,7 +21,23 @@ However, not every comparison level is available for every [Splink-compatible SQ
 
 The pre-made Splink comparison levels available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_level_library_dialect_table.md" %}
+||duckdb|spark|athena|sqlite|
+|-|-|-|-|-|
+|`array_intersect_level`|✓|✓|✓||
+|`columns_reversed_level`|✓|✓|✓|✓|
+|`damerau_levenshtein_level`|✓|✓|||
+|`datediff_level`|✓|✓|||
+|`distance_function_level`|✓|✓|✓|✓|
+|`distance_in_km_level`|✓|✓|✓||
+|`else_level`|✓|✓|✓|✓|
+|`exact_match_level`|✓|✓|✓|✓|
+|`jaccard_level`|✓|✓|||
+|`jaro_level`|✓|✓|||
+|`jaro_winkler_level`|✓|✓|||
+|`levenshtein_level`|✓|✓|✓|✓|
+|`null_level`|✓|✓|✓|✓|
+|`percentage_difference_level`|✓|✓|✓|✓|
+
 
 
 The detailed API for each of these are outlined below.

diff --git a/docs/comparison_library.md b/docs/comparison_library.md
@@ -17,7 +17,19 @@ However, not every comparison is available for every [Splink-compatible SQL back
 
 The pre-made Splink comparisons available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_library_dialect_table.md" %}
+||duckdb|spark|athena|sqlite|
+|-|-|-|-|-|
+|`array_intersect_at_sizes`|✓|✓|✓||
+|`damerau_levenshtein_at_thresholds`|✓|✓|||
+|`datediff_at_thresholds`|✓|✓|||
+|`distance_function_at_thresholds`|✓|✓|✓|✓|
+|`distance_in_km_at_thresholds`|✓|✓|✓||
+|`exact_match`|✓|✓|✓|✓|
+|`jaccard_at_thresholds`|✓|✓|||
+|`jaro_at_thresholds`|✓|✓|||
+|`jaro_winkler_at_thresholds`|✓|✓|||
+|`levenshtein_at_thresholds`|✓|✓|✓|✓|
+
 
 
 

diff --git a/docs/comparison_template_library.md b/docs/comparison_template_library.md
@@ -13,7 +13,14 @@ However, not every comparison is available for every [Splink-compatible SQL back
 
 The pre-made Splink comparison templates available for each SQL dialect are as given in this table:
 
-{% include-markdown "./includes/generated_files/comparison_template_library_dialect_table.md" %}
+||duckdb|spark|athena|sqlite|
+|-|-|-|-|-|
+|`date_comparison`|✓|✓|||
+|`email_comparison`|✓|✓|||
+|`forename_surname_comparison`|✓|✓|||
+|`name_comparison`|✓|✓|||
+|`postcode_comparison`|✓|✓|✓||
+
 
 
 The detailed API for each of these are outlined below.
@@ -66,4 +73,16 @@ The detailed API for each of these are outlined below.
       show_source: false
       heading_level: 2
 
+---
+
+::: splink.comparison_template_library.EmailComparisonBase
+    handler: python
+    selection:
+      members:
+        -  __init__
+    rendering:
+      show_root_heading: true
+      show_source: false
+      heading_level: 2
+
 ---
diff --git a/docs/topic_guides/comparison_templates.ipynb b/docs/topic_guides/comparison_templates.ipynb
@@ -901,6 +901,176 @@
    "source": [
     "Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
    ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Email Comparisons\n",
+    "\n",
+    "The [comparison_template_library](../comparison_template_library.md##splink.comparison_template_library) contains the [email_comparison](../comparison_template_library.md##splink.comparison_template_library.EmailComparisonBase) function which provides a sensible approach to comparing emails out-of-the-box. See [Feature Engineering](./feature_engineering.md) for more details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from splink.duckdb.duckdb_comparison_template_library import email_comparison\n",
+    "\n",
+    "email_comparison = email_comparison(\"email\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Gives a comparison structured as follows:\n",
+    "\n",
+    "```\n",
+    "Comparison: Email\n",
+    "├─-- ComparisonLevel: Exact match\n",
+    "├─-- ComparisonLevel: Exact match on username with different domain\n",
+    "├─-- ComparisonLevel: Fuzzy match on email using Jaro-Winkler\n",
+    "├─-- ComparisonLevel: Fuzzy match on username using Jaro-Winkler\n",
+    "├─-- ComparisonLevel: All other comparisons\n",
+    "```\n",
+    "\n",
+    "Or, using `human_readable_description` to generate automatically from `email_comparison`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Comparison 'Exact match vs. Exact username match different domain vs. Fuzzy Email within jaro_winkler threshold 0.88 vs. Fuzzy Username within jaro_winkler threshold 0.88 vs. anything else' of \"email\".\n",
+      "Similarity is assessed using the following ComparisonLevels:\n",
+      "    - 'Null' with SQL rule: \"email_l\" IS NULL OR \"email_r\" IS NULL\n",
+      "    - 'Exact match email' with SQL rule: \"email_l\" = \"email_r\"\n",
+      "    - 'Exact match email' with SQL rule: \n",
+      "        regexp_extract(\"email_l\", '^[^@]+')\n",
+      "     = \n",
+      "        regexp_extract(\"email_r\", '^[^@]+')\n",
+      "    \n",
+      "    - 'Jaro_winkler_similarity email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n",
+      "    - 'Jaro_winkler_similarity Username >= 0.88' with SQL rule: jaro_winkler_similarity(\n",
+      "        regexp_extract(\"email_l\", '^[^@]+')\n",
+      "    , \n",
+      "        regexp_extract(\"email_r\", '^[^@]+')\n",
+      "    ) >= 0.88\n",
+      "    - 'All other comparisons' with SQL rule: ELSE\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(email_comparison.human_readable_description)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "where individual email components are extracted under-the-hood using the `regex_extract` argument."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, the fuzzy matching is done using Jaro-Winkler thresholds. This will bias the start of a string, specifically the first four characters, which may not be appropriate for all emails. \n",
+    "\n",
+    "Users also have the option to set `invalid_emails_as_null` to `True`. If `True`, postcodes that do not adhere to a valid email format as determined by `valid_email_regex` will be included in the null level. `valid_email_regex` defaults to `\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\"`.\n",
+    "\n",
+    "For example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Comparison 'Exact match vs. Exact username match different domain vs. Fuzzy Email within jaro_winkler threshold 0.88 vs. Fuzzy Username within jaro_winkler threshold 0.88 vs. anything else' of \"email\".\n",
+      "Similarity is assessed using the following ComparisonLevels:\n",
+      "    - 'Null' with SQL rule: \n",
+      "        regexp_extract(\"email_l\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
+      "     IS NULL OR \n",
+      "        regexp_extract(\"email_r\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
+      "     IS NULL OR\n",
+      "                      \n",
+      "        regexp_extract(\"email_l\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
+      "    =='' OR \n",
+      "        regexp_extract(\"email_r\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
+      "     ==''\n",
+      "    - 'Exact match email' with SQL rule: \"email_l\" = \"email_r\"\n",
+      "    - 'Exact match email' with SQL rule: \n",
+      "        regexp_extract(\"email_l\", '^[^@]+')\n",
+      "     = \n",
+      "        regexp_extract(\"email_r\", '^[^@]+')\n",
+      "    \n",
+      "    - 'Jaro_winkler_similarity email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n",
+      "    - 'Jaro_winkler_similarity Username >= 0.88' with SQL rule: jaro_winkler_similarity(\n",
+      "        regexp_extract(\"email_l\", '^[^@]+')\n",
+      "    , \n",
+      "        regexp_extract(\"email_r\", '^[^@]+')\n",
+      "    ) >= 0.88\n",
+      "    - 'All other comparisons' with SQL rule: ELSE\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "email_comparison = email_comparison(\n",
+    "    \"email\",\n",
+    "    invalid_emails_as_null=True\n",
+    ")\n",
+    "print(email_comparison.human_readable_description)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To see this as a specifications dictionary you can call"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "email_comparison.as_dict()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {