Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] settings_obj._source_dataset_col and settings_obj._source_dataset_input_column #1711

Closed
RobinL opened this issue Nov 8, 2023 · 8 comments · Fixed by #1731
Closed

Comments

@RobinL
Copy link
Member

RobinL commented Nov 8, 2023

The following two properties do not return expected types:

linker._settings_obj._source_dataset_col
linker._settings_obj._source_dataset_input_column

Related to this

Demo:

script to illustrate problem
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker([df,df], settings)

print(linker._settings_obj._source_dataset_col)
print(linker._settings_obj._source_dataset_input_column)

Returns:

print(linker._settings_obj._source_dataset_col)
> ('source_dataset', '"source_dataset"')

print(linker._settings_obj._source_dataset_input_column)
> source_dataset
```

It's unclear to me what `_source_dataset_col` should be doing  but `_source_dataset_input_column` should definitely return an `InputColumn` or be renamed 
@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

The original problem was that Splink failed if the user ran a link only job with the source dataset columns already populated.

Let's recreate it:
git checkout a83bf7ab83e0a954dcff69bd6ef36779cf236b6b

demo script where source dataset column is created by Splink - everything works
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 2

df_left = df[df["side"] == 0]
df_right = df[df["side"] == 1]



linker = DuckDBLinker(
    [df_left, df_right], settings, input_table_aliases=["df_left", "df_right"]
)

linker.predict().as_pandas_dataframe(limit=2)
three dataset works
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    # "source_dataset_column_name": "src_dataset",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 3

df_1 = df[df["side"] == 0]
# df_left["src_dataset"] = "left"

df_2 = df[df["side"] == 1]
# df_right["src_dataset"] = "right"

df_3 = df[df["side"] == 2]


linker = DuckDBLinker(
    [df_1, df_2, df_3], settings, input_table_aliases=["df_1", "df_2", "df_3"])
linker.debug_mode = True
linker.predict().as_pandas_dataframe(limit=2)
demo script where source dataset column is manually set - predict() results in no output
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    "source_dataset_column_name": "src_dataset",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 2

df_left = df[df["side"] == 0]
df_left["src_dataset"] = "left"

df_right = df[df["side"] == 1]
df_right["src_dataset"] = "right"




linker = DuckDBLinker(
    [df_left, df_right], settings, input_table_aliases=["df_left", "df_right"]
)

linker.predict().as_pandas_dataframe(limit=2)

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

Further investigation of original bug.

Here's where the col gets created

--------Creating table: __splink__df_concat--------

            select 'df_left' as source_dataset, "index", "unique_id", "first_name", "surname", "dob", "city", "email", "group", "side"
            
            from df_left
             UNION ALL 
            select 'df_right' as source_dataset, "index", "unique_id", "first_name", "surname", "dob", "city", "email", "group", "side"
            
            from df_right

Need to look at behaviour in link only three dataset too

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

__splink__df_concat just concats everything

Note that it creates select '_a' as source_dataset irrespective of whether it's required

It's only when you get to blocking that the source_dataset column gets dropped if source_dataset_columns_name is set

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

More succincgt testing script:

from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    "source_dataset_column_name": "src_dataset",
    "link_type": "link_and_dedupe",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
    ],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 2

df_left = df[df["side"] == 0]
df_left["src_dataset"] = "left"

df_right = df[df["side"] == 1]
df_right["src_dataset"] = "right"


linker = DuckDBLinker([df_left, df_right], settings)
linker.debug_mode = True
linker.predict().as_pandas_dataframe(limit=2)

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

One definite bug before was in blocking here:

    if (
        linker._two_dataset_link_only
        and not linker._find_new_matches_mode
        and not linker._compare_two_records_mode
    ):
        source_dataset_col = linker._settings_obj._source_dataset_column_name
        # Need df_l to be the one with the lowest id to preeserve the property
        # that the left dataset is the one with the lowest concatenated id
        keys = linker._input_tables_dict.keys()
        keys = list(sorted(keys))
        df_l = linker._input_tables_dict[keys[0]]
        df_r = linker._input_tables_dict[keys[1]]
        # The problem is that you need to value contained within the source 
        # dataset colum here.
        # Where we created it, that's fine
        # but where it was user-provided, we don't know what the value is

        
        sql = f"""
        select * from __splink__df_concat_with_tf
        where {source_dataset_col} = '{df_l.templated_name}'
        """

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

aha - that logic is only needed in the two dataset case, where the above makes execution more efficient

in the three+ dataset case it's implicit and you don't need to know the values in the column

   select
            "l"."src_dataset" as "src_dataset_l", "r"."src_dataset" as "src_dataset_r", "l"."unique_id" as "unique_id_l", "r"."unique_id" as "unique_id_r", "l"."first_name" as "first_name_l", "r"."first_name" as "first_name_r", "l"."surname" as "surname_l", "r"."surname" as "surname_r"
            , '0' as match_key
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            l.first_name = r.first_name
            
            where l."src_dataset" || '-__-' || l."unique_id" < r."src_dataset" || '-__-' || r."unique_id" and l."src_dataset" != r."src_dataset"
            

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

Plan:

the fix
 if (
        linker._two_dataset_link_only
        and not linker._find_new_matches_mode
        and not linker._compare_two_records_mode
    ):
        source_dataset_col = linker._settings_obj._source_dataset_column_name
        # Need df_l to be the one with the lowest id to preeserve the property
        # that the left dataset is the one with the lowest concatenated id

        # No - that doesn't even work!!!

        keys = linker._input_tables_dict.keys()
        keys = list(sorted(keys))
        df_l = linker._input_tables_dict[keys[0]]
        df_r = linker._input_tables_dict[keys[1]]
        # The problem is that you need to value contained within the source
        # dataset colum here.
        # Where we created it, that's fine
        # but where it was user-provided, we don't know what the value is

        sql = f"""
        select * from __splink__df_concat_with_tf
        where {source_dataset_col} = (select min({source_dataset_col}) from __splink__df_concat_with_tf)
        """
        linker._enqueue_sql(sql, "__splink__df_concat_with_tf_left")

        sql = f"""
        select * from __splink__df_concat_with_tf
        where {source_dataset_col} = (select max({source_dataset_col}) from __splink__df_concat_with_tf)
        """
        linker._enqueue_sql(sql, "__splink_df_concat_with_tf_right")

Remember afterwards to experiment with removing the creation of the source_dataset in the case that a source_dataset_column_name has been provided

@RobinL
Copy link
Member Author

RobinL commented Nov 13, 2023

Note _source_dataset_col got added later than 1193- it's not present in 1193

aha - there was a follow up
#1204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant