Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for Feature Engineering #1178

Merged
merged 19 commits into from
Apr 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
385 changes: 385 additions & 0 deletions docs/topic_guides/comparison_templates.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Out-of-the-box Comparisons for specific data types\n",
"\n",
"Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The [Comparison Template Library](customising_comparisons.ipynb#method-2-using-the-comparisontemplatelibrary) contains functions to generate read-made comparisons for a variety of data types.\n",
"\n",
"Below are examples of how to structure comparisons for a variety of data types.\n",
"\n",
"<hr>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Date Comparisons\n",
"\n",
"Date comparisons are generally structured as: \n",
"\n",
"- Null level \n",
"- Exact match \n",
"- Fuzzy match ([using metric of choice](comparators.md)) \n",
"- Interval match (within X days/months/years) \n",
"- Else level\n",
"\n",
"The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from splink.duckdb.duckdb_comparison_template_library import date_comparison\n",
"\n",
"date_of_birth_comparison = date_comparison(\"date_of_birth\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Gives a comparison structured as follows:\n",
"\n",
"```\n",
"Comparison: Date of birth\n",
"├─-- ComparisonLevel: Exact match\n",
"├─-- ComparisonLevel: Up to one character difference\n",
"├─-- ComparisonLevel: Up to two character difference\n",
"├─-- ComparisonLevel: Dates within 1 year of each other\n",
"├─-- ComparisonLevel: Dates within 10 years of each other\n",
"├─-- ComparisonLevel: All other\n",
"```\n",
"\n",
"Or, using `human_readable_description` to generate automatically from `date_of_birth_comparison`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Dates within levenshtein thresholds 1, 2 vs. Dates within the following thresholds Year(s): 1, Year(s): 10 vs. anything else' of \"date_of_birth\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
" - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
" - 'Levenshtein <= 1' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1\n",
" - 'Levenshtein <= 2' with SQL rule: levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 2\n",
" - 'Within 1 year' with SQL rule: \n",
" abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
" \n",
" - 'Within 10 years' with SQL rule: \n",
" abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 10\n",
" \n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
RossKen marked this conversation as resolved.
Show resolved Hide resolved
]
}
],
"source": [
"print(date_of_birth_comparison.human_readable_description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The [date_comparison](../comparison_template_library.md##splink.comparison_template_library.DateComparisonBase) function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
"\n",
"For example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else' of \"date_of_birth\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL\n",
" - 'Exact match' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
" - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88\n",
" - 'Within 1 month' with SQL rule: \n",
" abs(date_diff('month', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
" \n",
" - 'Within 1 year' with SQL rule: \n",
" abs(date_diff('year', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\n",
" \n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"date_of_birth_comparison = date_comparison(\n",
" \"date_of_birth\",\n",
" levenshtein_thresholds=[],\n",
" jaro_winkler_thresholds=[0.88],\n",
" datediff_thresholds=[1, 1],\n",
" datediff_metrics=[\"month\", \"year\"],\n",
")\n",
"print(date_of_birth_comparison.human_readable_description)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see this as a specifications dictionary you can call"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'output_column_name': 'date_of_birth',\n",
" 'comparison_levels': [{'sql_condition': '\"date_of_birth_l\" IS NULL OR \"date_of_birth_r\" IS NULL',\n",
" 'label_for_charts': 'Null',\n",
" 'is_null_level': True},\n",
" {'sql_condition': '\"date_of_birth_l\" = \"date_of_birth_r\"',\n",
" 'label_for_charts': 'Exact match'},\n",
" {'sql_condition': 'jaro_winkler_similarity(\"date_of_birth_l\", \"date_of_birth_r\") >= 0.88',\n",
" 'label_for_charts': 'Jaro_winkler_similarity >= 0.88'},\n",
" {'sql_condition': '\\n abs(date_diff(\\'month\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n ',\n",
" 'label_for_charts': 'Within 1 month'},\n",
" {'sql_condition': '\\n abs(date_diff(\\'year\\', \"date_of_birth_l\", \"date_of_birth_r\")) <= 1\\n ',\n",
" 'label_for_charts': 'Within 1 year'},\n",
" {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
" 'comparison_description': 'Exact match vs. Dates within jaro_winkler threshold 0.88 vs. Dates within the following thresholds Month(s): 1, Year(s): 1 vs. anything else'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_of_birth_comparison.as_dict()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired.\n",
"\n",
"<hr>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Name Comparisons\n",
"\n",
"Name comparisons for an individual name column (e.g. forename, surname) are generally structured as: \n",
"\n",
"- Null level \n",
"- Exact match \n",
"- Fuzzy match ([using metric of choice](comparators.md)) \n",
"- Else level\n",
"\n",
"The [comparison_template_library](#method-2-using-the-comparisontemplatelibrary) contains the [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function which gives this structure, with some pre-defined parameters, out-of-the-box."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from splink.duckdb.duckdb_comparison_template_library import name_comparison\n",
"\n",
"first_name_comparison = name_comparison(\"first_name\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Gives a comparison structured as follows:\n",
"\n",
"```\n",
"Comparison: Date of birth\n",
"├─-- ComparisonLevel: Exact match\n",
"├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.95\n",
"├─-- ComparisonLevel: First Names with Jaro-Winkler similarity greater than 0.88\n",
"├─-- ComparisonLevel: All other\n",
"```\n",
"\n",
"Or, using `human_readable_description` to generate automatically from `first_name_comparison`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Names within jaro_winkler thresholds 0.95, 0.88 vs. anything else' of \"first_name\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
" - 'Exact match first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
" - 'Jaro_winkler_similarity >= 0.95' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.95\n",
" - 'Jaro_winkler_similarity >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"print(first_name_comparison.human_readable_description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The [name_comparison](../comparison_template_library.md#splink.comparison_template_library.NameComparisonBase) function also allowing flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
"\n",
"For example:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comparison 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else' of \"surname\" and \"surname_dm\".\n",
"Similarity is assessed using the following ComparisonLevels:\n",
" - 'Null' with SQL rule: \"surname_l\" IS NULL OR \"surname_r\" IS NULL\n",
" - 'Exact match surname' with SQL rule: \"surname_l\" = \"surname_r\"\n",
" - 'Exact match surname_dm' with SQL rule: \"surname_dm_l\" = \"surname_dm_r\"\n",
" - 'Levenshtein <= 2' with SQL rule: levenshtein(\"surname_l\", \"surname_r\") <= 2\n",
" - 'Jaccard >= 1' with SQL rule: jaccard(\"surname_l\", \"surname_r\") >= 1\n",
" - 'All other comparisons' with SQL rule: ELSE\n",
"\n"
]
}
],
"source": [
"surname_comparison = name_comparison(\n",
" \"surname\",\n",
" phonetic_col_name=\"surname_dm\",\n",
" term_frequency_adjustments_name=True,\n",
" levenshtein_thresholds=[2],\n",
" jaro_winkler_thresholds=[],\n",
" jaccard_thresholds=[1],\n",
")\n",
"print(surname_comparison.human_readable_description)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](phonetic.md)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see this as a specifications dictionary you can call"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'output_column_name': 'custom_surname_surname_dm',\n",
" 'comparison_levels': [{'sql_condition': '\"surname_l\" IS NULL OR \"surname_r\" IS NULL',\n",
" 'label_for_charts': 'Null',\n",
" 'is_null_level': True},\n",
" {'sql_condition': '\"surname_l\" = \"surname_r\"',\n",
" 'label_for_charts': 'Exact match surname',\n",
" 'tf_adjustment_column': 'surname',\n",
" 'tf_adjustment_weight': 1.0},\n",
" {'sql_condition': '\"surname_dm_l\" = \"surname_dm_r\"',\n",
" 'label_for_charts': 'Exact match surname_dm'},\n",
" {'sql_condition': 'levenshtein(\"surname_l\", \"surname_r\") <= 2',\n",
" 'label_for_charts': 'Levenshtein <= 2'},\n",
" {'sql_condition': 'jaccard(\"surname_l\", \"surname_r\") >= 1',\n",
" 'label_for_charts': 'Jaccard >= 1'},\n",
" {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
" 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Dates within levenshtein threshold 2 vs. Names within jaccard threshold 1 vs. anything else'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"surname_comparison.as_dict()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-4-providing-the-spec-as-a-dictionary), if desired."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading