Skip to content

Commit

Permalink
Merge f87b4a2 into fa7aea7
Browse files Browse the repository at this point in the history
  • Loading branch information
RossKen committed Apr 12, 2023
2 parents fa7aea7 + f87b4a2 commit ffe2173
Show file tree
Hide file tree
Showing 5 changed files with 719 additions and 354 deletions.
385 changes: 385 additions & 0 deletions docs/topic_guides/comparison_templates.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Out-of-the-box Comparisons for specific data types\n",
"\n",
"Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate read-made comparisons for a variety of data types.\n",
"\n",
"Below are examples of how to structure comparisons for a variety of data types.\n",
"\n",
"<hr>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Date Comparisons\n",
"\n",
"Date comparisons are generally structured as: \n",
"\n",
"- Null level \n",
"- Exact match \n",
"- Fuzzy match ([using metric of choice](comparators.md)) \n",
"- Interval match (within X days/months/years) \n",
"- Else level\n",
"\n",
"The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from splink.duckdb.duckdb_comparison_template_library import date_comparison\n",
"\n",
"date_of_birth_comparison = date_comparison(\"date_of_birth\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Gives a comparison structured as follows:\n",
"\n",
"```\n",
"Comparison: Date of birth\n",
"├─-- ComparisonLevel: Exact match\n",
"├─-- ComparisonLevel: Up to one character difference\n",
"├─-- ComparisonLevel: Up to two character difference\n",
"├─-- ComparisonLevel: Dates within 1 year of each other\n",
"├─-- ComparisonLevel: Dates within 10 years of each other\n",
"├─-- ComparisonLevel: All other\n",
"```\n",
"\n",
"Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Dates within levenshtein thresholds 1, 2 vs. Dates within the following thresholds Year(s): 1, Year(s): 10 vs. anything else' of \"date_of_birth\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
" - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
" - 'Levenshtein <= 1' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1\n",
" - 'Levenshtein <= 2' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 2\n",
" - 'Within 1 year' with SQL rule: \n",
" abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
" \n",
" - 'Within 10 years' with SQL rule: \n",
" abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 10\n",
" \n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"print(date_of_birth_comparison.human_readable_description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
"\n",
"For example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else' of \"date_of_birth\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
" - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
" - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88\n",
" - 'Within 1 month' with SQL rule: \n",
" abs(date_diff('month', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
" \n",
" - 'Within 1 year' with SQL rule: \n",
" abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
" \n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"date_of_birth_comparison = date_comparison(\n",
" \"date_of_birth\",\n",
" levenshtein_thresholds=[],\n",
" jaro_winkler_thresholds=[0.88],\n",
" datediff_thresholds=[1, 1],\n",
" datediff_metrics=[\"month\", \"year\"],\n",
")\n",
"print(date_of_birth_comparison.human_readable_description)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see this as a specifications dictionary you can call"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'output_column_name': 'date_of_birth',\n",
" 'comparison_levels': [{'sql_condition': '\"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL',\n",
" 'label_for_charts': 'Null',\n",
" 'is_null_level': True},\n",
" {'sql_condition': '\"date_of_birth_l\" = \"date_of_birth_r\"',\n",
" 'label_for_charts': 'Exact match'},\n",
" {'sql_condition': 'jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88',\n",
" 'label_for_charts': 'Jaro_winkler_similarity >= 0.88'},\n",
" {'sql_condition': '\\n abs(date_diff(\\'month\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n ',\n",
" 'label_for_charts': 'Within 1 month'},\n",
" {'sql_condition': '\\n abs(date_diff(\\'year\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n ',\n",
" 'label_for_charts': 'Within 1 year'},\n",
" {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
" 'comparison_description': 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_of_birth_comparison.as_dict()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.\n",
"\n",
"<hr>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Name Comparisons\n",
"\n",
"Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: \n",
"\n",
"- Null level \n",
"- Exact match \n",
"- Fuzzy match ([using metric of choice](comparators.md)) \n",
"- Else level\n",
"\n",
"The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from splink.duckdb.duckdb_comparison_template_library import name_comparison\n",
"\n",
"first_name_comparison = name_comparison(\"first_name\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Gives a comparison structured as follows:\n",
"\n",
"```\n",
"Comparison: Date of birth\n",
"├─-- ComparisonLevel: Exact match\n",
"├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.95\n",
"├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.88\n",
"├─-- ComparisonLevel: All other\n",
"```\n",
"\n",
"Or, using `human_readable_description` to generate automatically from `first_name_comparison`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Names within jaro_winkler thresholds 0.95, 0.88 vs. anything else' of \"first_name\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
" - 'Exact match first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
" - 'Jaro_winkler_similarity >= 0.95' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.95\n",
" - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"print(first_name_comparison.human_readable_description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
"\n",
"For example:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else' of \"surname\" and \"surname_dm\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"surname_l\" IS NULL OR \"surname_r\" IS NULL\n",
" - 'Exact match surname' with SQL rule: \"surname_l\" = \"surname_r\"\n",
" - 'Exact match surname_dm' with SQL rule: \"surname_dm_l\" = \"surname_dm_r\"\n",
" - 'Levenshtein <= 2' with SQL rule: levenshtein(\"surname_l\", \"surname_r\") <= 2\n",
" - 'Jaccard >= 1' with SQL rule: jaccard(\"surname_l\", \"surname_r\") >= 1\n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"surname_comparison = name_comparison(\n",
" \"surname\",\n",
" phonetic_col_name=\"surname_dm\",\n",
" term_frequency_adjustments_name=True,\n",
" levenshtein_thresholds=[2],\n",
" jaro_winkler_thresholds=[],\n",
" jaccard_thresholds=[1],\n",
")\n",
"print(surname_comparison.human_readable_description)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see this as a specifications dictionary you can call"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'output_column_name': 'custom_surname_surname_dm',\n",
" 'comparison_levels': [{'sql_condition': '\"surname_l\" IS NULL OR \"surname_r\" IS NULL',\n",
" 'label_for_charts': 'Null',\n",
" 'is_null_level': True},\n",
" {'sql_condition': '\"surname_l\" = \"surname_r\"',\n",
" 'label_for_charts': 'Exact match surname',\n",
" 'tf_adjustment_column': 'surname',\n",
" 'tf_adjustment_weight': 1.0},\n",
" {'sql_condition': '\"surname_dm_l\" = \"surname_dm_r\"',\n",
" 'label_for_charts': 'Exact match surname_dm'},\n",
" {'sql_condition': 'levenshtein(\"surname_l\", \"surname_r\") <= 2',\n",
" 'label_for_charts': 'Levenshtein <= 2'},\n",
" {'sql_condition': 'jaccard(\"surname_l\", \"surname_r\") >= 1',\n",
" 'label_for_charts': 'Jaccard >= 1'},\n",
" {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
" 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"surname_comparison.as_dict()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit ffe2173

Please sign in to comment.