Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for email comparison level feature. #1277

Merged
merged 27 commits into from
Jun 12, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9ceb28c
Initial commit for email comparison level feature. Added email compar…
sama-ds Jun 1, 2023
9c71966
Spotted some errors within name comparisons. Some redundant parameter…
sama-ds Jun 1, 2023
a8695e7
Merge branch 'master' into email_comparrison_template
RossKen Jun 2, 2023
a9c0ed5
lint with black
RossKen Jun 2, 2023
293c1e2
fix paths to comparisons
RossKen Jun 2, 2023
6eafc1a
Merge branch 'email_comparrison_template' of github.com:moj-analytica…
RossKen Jun 2, 2023
ceb8fe0
lint with black
RossKen Jun 2, 2023
66755e0
improve labelling for fuzzy matches
RossKen Jun 5, 2023
6aa69e2
Merge branch 'email_comparrison_template' of github.com:moj-analytica…
RossKen Jun 5, 2023
e89f013
Addded functionality for athena linker to run. Committing prior to fi…
sama-ds Jun 9, 2023
f519c5e
Adding asserting across the gamma lookup in conftest.py to be used ac…
sama-ds Jun 9, 2023
403f1fd
As athena does not have JW functionality, these changes are redundant…
sama-ds Jun 9, 2023
c28c06c
Merge branch 'master' into email_comparrison_template
sama-ds Jun 12, 2023
0a724fe
lint with black
sama-ds Jun 12, 2023
a6de7c9
Updated the comparison_template_library files to align with the new d…
sama-ds Jun 12, 2023
b2e0aad
lint with black
sama-ds Jun 12, 2023
79c13f2
Included new test_gamma_assert function in postcode and dob ctl's.
sama-ds Jun 12, 2023
8807d7e
lint with black
sama-ds Jun 12, 2023
3ee6225
Added docs to topic guide for email comparison. Spotted a few mistake…
sama-ds Jun 12, 2023
580b876
minor docs changes
RossKen Jun 12, 2023
b2ce885
remove athena example
RossKen Jun 12, 2023
562267e
add jaro to email and small changes
RossKen Jun 12, 2023
be958f2
Merge branch 'master' into email_comparrison_template
RossKen Jun 12, 2023
69f4aee
add email_comparison to README
RossKen Jun 12, 2023
a93b600
Merge branch 'email_comparrison_template' of github.com:moj-analytica…
RossKen Jun 12, 2023
e916830
polish topic guide
RossKen Jun 12, 2023
a000611
fix imports
RossKen Jun 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion docs/comparison_level_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,23 @@ However, not every comparison level is available for every [Splink-compatible SQ

The pre-made Splink comparison levels available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_level_library_dialect_table.md" %}
||duckdb|spark|athena|sqlite|
|-|-|-|-|-|
|`array_intersect_level`|✓|✓|✓||
|`columns_reversed_level`|✓|✓|✓|✓|
|`damerau_levenshtein_level`|✓|✓|||
|`datediff_level`|✓|✓|||
|`distance_function_level`|✓|✓|✓|✓|
|`distance_in_km_level`|✓|✓|✓||
|`else_level`|✓|✓|✓|✓|
|`exact_match_level`|✓|✓|✓|✓|
|`jaccard_level`|✓|✓|||
|`jaro_level`|✓|✓|||
|`jaro_winkler_level`|✓|✓|||
|`levenshtein_level`|✓|✓|✓|✓|
|`null_level`|✓|✓|✓|✓|
|`percentage_difference_level`|✓|✓|✓|✓|



The detailed API for each of these are outlined below.
Expand Down
14 changes: 13 additions & 1 deletion docs/comparison_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,19 @@ However, not every comparison is available for every [Splink-compatible SQL back

The pre-made Splink comparisons available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_library_dialect_table.md" %}
||duckdb|spark|athena|sqlite|
|-|-|-|-|-|
|`array_intersect_at_sizes`|✓|✓|✓||
|`damerau_levenshtein_at_thresholds`|✓|✓|||
|`datediff_at_thresholds`|✓|✓|||
|`distance_function_at_thresholds`|✓|✓|✓|✓|
|`distance_in_km_at_thresholds`|✓|✓|✓||
|`exact_match`|✓|✓|✓|✓|
|`jaccard_at_thresholds`|✓|✓|||
|`jaro_at_thresholds`|✓|✓|||
|`jaro_winkler_at_thresholds`|✓|✓|||
|`levenshtein_at_thresholds`|✓|✓|✓|✓|

RossKen marked this conversation as resolved.
Show resolved Hide resolved



Expand Down
21 changes: 20 additions & 1 deletion docs/comparison_template_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,14 @@ However, not every comparison is available for every [Splink-compatible SQL back

The pre-made Splink comparison templates available for each SQL dialect are as given in this table:

{% include-markdown "./includes/generated_files/comparison_template_library_dialect_table.md" %}
||duckdb|spark|athena|sqlite|
|-|-|-|-|-|
|`date_comparison`|✓|✓|||
|`email_comparison`|✓|✓|||
|`forename_surname_comparison`|✓|✓|||
|`name_comparison`|✓|✓|||
|`postcode_comparison`|✓|✓|✓||



The detailed API for each of these are outlined below.
Expand Down Expand Up @@ -66,4 +73,16 @@ The detailed API for each of these are outlined below.
show_source: false
heading_level: 2

---

::: splink.comparison_template_library.EmailComparisonBase
handler: python
selection:
members:
- __init__
rendering:
show_root_heading: true
show_source: false
heading_level: 2

---
170 changes: 170 additions & 0 deletions docs/topic_guides/comparison_templates.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -901,6 +901,176 @@
"source": [
"Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Email Comparisons\n",
"\n",
"The [comparison_template_library](../comparison_template_library.md##splink.comparison_template_library) contains the [email_comparison](../comparison_template_library.md##splink.comparison_template_library.EmailComparisonBase) function which provides a sensible approach to comparing emails out-of-the-box. See [Feature Engineering](./feature_engineering.md) for more details."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from splink.duckdb.duckdb_comparison_template_library import email_comparison\n",
"\n",
"email_comparison = email_comparison(\"email\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Gives a comparison structured as follows:\n",
"\n",
"```\n",
"Comparison: Email\n",
"├─-- ComparisonLevel: Exact match\n",
"├─-- ComparisonLevel: Exact match on username with different domain\n",
"├─-- ComparisonLevel: Fuzzy match on email using Jaro-Winkler\n",
"├─-- ComparisonLevel: Fuzzy match on username using Jaro-Winkler\n",
"├─-- ComparisonLevel: All other comparisons\n",
"```\n",
"\n",
"Or, using `human_readable_description` to generate automatically from `email_comparison`:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Exact username match different domain vs. Fuzzy Email within jaro_winkler threshold 0.88 vs. Fuzzy Username within jaro_winkler threshold 0.88 vs. anything else' of \"email\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"email_l\" IS NULL OR \"email_r\" IS NULL\n",
" - 'Exact match email' with SQL rule: \"email_l\" = \"email_r\"\n",
" - 'Exact match email' with SQL rule: \n",
" regexp_extract(\"email_l\", '^[^@]+')\n",
" = \n",
" regexp_extract(\"email_r\", '^[^@]+')\n",
" \n",
" - 'Jaro_winkler_similarity email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n",
" - 'Jaro_winkler_similarity Username >= 0.88' with SQL rule: jaro_winkler_similarity(\n",
" regexp_extract(\"email_l\", '^[^@]+')\n",
" , \n",
" regexp_extract(\"email_r\", '^[^@]+')\n",
" ) >= 0.88\n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"print(email_comparison.human_readable_description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"where individual email components are extracted under-the-hood using the `regex_extract` argument."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the fuzzy matching is done using Jaro-Winkler thresholds. This will bias the start of a string, specifically the first four characters, which may not be appropriate for all emails. \n",
"\n",
"Users also have the option to set `invalid_emails_as_null` to `True`. If `True`, postcodes that do not adhere to a valid email format as determined by `valid_email_regex` will be included in the null level. `valid_email_regex` defaults to `\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\"`.\n",
"\n",
"For example:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Exact username match different domain vs. Fuzzy Email within jaro_winkler threshold 0.88 vs. Fuzzy Username within jaro_winkler threshold 0.88 vs. anything else' of \"email\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \n",
" regexp_extract(\"email_l\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
" IS NULL OR \n",
" regexp_extract(\"email_r\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
" IS NULL OR\n",
" \n",
" regexp_extract(\"email_l\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
" =='' OR \n",
" regexp_extract(\"email_r\", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')\n",
" ==''\n",
" - 'Exact match email' with SQL rule: \"email_l\" = \"email_r\"\n",
" - 'Exact match email' with SQL rule: \n",
" regexp_extract(\"email_l\", '^[^@]+')\n",
" = \n",
" regexp_extract(\"email_r\", '^[^@]+')\n",
" \n",
" - 'Jaro_winkler_similarity email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n",
" - 'Jaro_winkler_similarity Username >= 0.88' with SQL rule: jaro_winkler_similarity(\n",
" regexp_extract(\"email_l\", '^[^@]+')\n",
" , \n",
" regexp_extract(\"email_r\", '^[^@]+')\n",
" ) >= 0.88\n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"email_comparison = email_comparison(\n",
" \"email\",\n",
" invalid_emails_as_null=True\n",
")\n",
"print(email_comparison.human_readable_description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To see this as a specifications dictionary you can call"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"email_comparison.as_dict()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
Expand Down
Loading