-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] settings_obj._source_dataset_col
and settings_obj._source_dataset_input_column
#1711
Comments
The original problem was that Splink failed if the user ran a link only job with the source dataset columns already populated. Let's recreate it: demo script where source dataset column is created by Splink - everything worksfrom splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
import pandas as pd
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "link_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
levenshtein_at_thresholds("first_name", 2),
exact_match("surname"),
exact_match("dob"),
exact_match("city", term_frequency_adjustments=True),
exact_match("email"),
],
"retain_matching_columns": False,
"retain_intermediate_calculation_columns": False,
"additional_columns_to_retain": ["group"],
"max_iterations": 10,
"em_convergence": 0.01,
}
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
df = df.reset_index()
df["side"] = df.index % 2
df_left = df[df["side"] == 0]
df_right = df[df["side"] == 1]
linker = DuckDBLinker(
[df_left, df_right], settings, input_table_aliases=["df_left", "df_right"]
)
linker.predict().as_pandas_dataframe(limit=2) three dataset worksfrom splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
import pandas as pd
settings = {
"probability_two_random_records_match": 0.01,
# "source_dataset_column_name": "src_dataset",
"link_type": "link_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
levenshtein_at_thresholds("first_name", 2),
exact_match("surname"),
exact_match("dob"),
exact_match("city", term_frequency_adjustments=True),
exact_match("email"),
],
"retain_matching_columns": False,
"retain_intermediate_calculation_columns": False,
"additional_columns_to_retain": ["group"],
"max_iterations": 10,
"em_convergence": 0.01,
}
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
df = df.reset_index()
df["side"] = df.index % 3
df_1 = df[df["side"] == 0]
# df_left["src_dataset"] = "left"
df_2 = df[df["side"] == 1]
# df_right["src_dataset"] = "right"
df_3 = df[df["side"] == 2]
linker = DuckDBLinker(
[df_1, df_2, df_3], settings, input_table_aliases=["df_1", "df_2", "df_3"])
linker.debug_mode = True
linker.predict().as_pandas_dataframe(limit=2) demo script where source dataset column is manually set - predict() results in no outputfrom splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
import pandas as pd
settings = {
"probability_two_random_records_match": 0.01,
"source_dataset_column_name": "src_dataset",
"link_type": "link_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
levenshtein_at_thresholds("first_name", 2),
exact_match("surname"),
exact_match("dob"),
exact_match("city", term_frequency_adjustments=True),
exact_match("email"),
],
"retain_matching_columns": False,
"retain_intermediate_calculation_columns": False,
"additional_columns_to_retain": ["group"],
"max_iterations": 10,
"em_convergence": 0.01,
}
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
df = df.reset_index()
df["side"] = df.index % 2
df_left = df[df["side"] == 0]
df_left["src_dataset"] = "left"
df_right = df[df["side"] == 1]
df_right["src_dataset"] = "right"
linker = DuckDBLinker(
[df_left, df_right], settings, input_table_aliases=["df_left", "df_right"]
)
linker.predict().as_pandas_dataframe(limit=2) |
Further investigation of original bug. Here's where the col gets created
Need to look at behaviour in link only three dataset too |
Note that it creates It's only when you get to blocking that the |
More succincgt testing script: from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
exact_match,
)
import pandas as pd
settings = {
"probability_two_random_records_match": 0.01,
"source_dataset_column_name": "src_dataset",
"link_type": "link_and_dedupe",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
],
"retain_matching_columns": False,
"retain_intermediate_calculation_columns": False,
}
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
df = df.reset_index()
df["side"] = df.index % 2
df_left = df[df["side"] == 0]
df_left["src_dataset"] = "left"
df_right = df[df["side"] == 1]
df_right["src_dataset"] = "right"
linker = DuckDBLinker([df_left, df_right], settings)
linker.debug_mode = True
linker.predict().as_pandas_dataframe(limit=2) |
One definite bug before was in blocking here: if (
linker._two_dataset_link_only
and not linker._find_new_matches_mode
and not linker._compare_two_records_mode
):
source_dataset_col = linker._settings_obj._source_dataset_column_name
# Need df_l to be the one with the lowest id to preeserve the property
# that the left dataset is the one with the lowest concatenated id
keys = linker._input_tables_dict.keys()
keys = list(sorted(keys))
df_l = linker._input_tables_dict[keys[0]]
df_r = linker._input_tables_dict[keys[1]]
# The problem is that you need to value contained within the source
# dataset colum here.
# Where we created it, that's fine
# but where it was user-provided, we don't know what the value is
sql = f"""
select * from __splink__df_concat_with_tf
where {source_dataset_col} = '{df_l.templated_name}'
""" |
aha - that logic is only needed in the two dataset case, where the above makes execution more efficient in the three+ dataset case it's implicit and you don't need to know the values in the column
|
Plan:
the fix if (
linker._two_dataset_link_only
and not linker._find_new_matches_mode
and not linker._compare_two_records_mode
):
source_dataset_col = linker._settings_obj._source_dataset_column_name
# Need df_l to be the one with the lowest id to preeserve the property
# that the left dataset is the one with the lowest concatenated id
# No - that doesn't even work!!!
keys = linker._input_tables_dict.keys()
keys = list(sorted(keys))
df_l = linker._input_tables_dict[keys[0]]
df_r = linker._input_tables_dict[keys[1]]
# The problem is that you need to value contained within the source
# dataset colum here.
# Where we created it, that's fine
# but where it was user-provided, we don't know what the value is
sql = f"""
select * from __splink__df_concat_with_tf
where {source_dataset_col} = (select min({source_dataset_col}) from __splink__df_concat_with_tf)
"""
linker._enqueue_sql(sql, "__splink__df_concat_with_tf_left")
sql = f"""
select * from __splink__df_concat_with_tf
where {source_dataset_col} = (select max({source_dataset_col}) from __splink__df_concat_with_tf)
"""
linker._enqueue_sql(sql, "__splink_df_concat_with_tf_right") Remember afterwards to experiment with removing the creation of the |
Note _source_dataset_col got added later than 1193- it's not present in 1193 aha - there was a follow up |
The following two properties do not return expected types:
Related to this
Demo:
script to illustrate problem
Returns:
The text was updated successfully, but these errors were encountered: