[BUG] `settings_obj._source_dataset_col` and `settings_obj._source_dataset_input_column` #1711

RobinL · 2023-11-08T15:06:18Z

The following two properties do not return expected types:

linker._settings_obj._source_dataset_col
linker._settings_obj._source_dataset_input_column

Related to this

Demo:

script to illustrate problem

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker([df,df], settings)

print(linker._settings_obj._source_dataset_col)
print(linker._settings_obj._source_dataset_input_column)

Returns:

print(linker._settings_obj._source_dataset_col)
> ('source_dataset', '"source_dataset"')

print(linker._settings_obj._source_dataset_input_column)
> source_dataset
```

It's unclear to me what `_source_dataset_col` should be doing  but `_source_dataset_input_column` should definitely return an `InputColumn` or be renamed

The text was updated successfully, but these errors were encountered:

RobinL · 2023-11-13T16:49:01Z

The original problem was that Splink failed if the user ran a link only job with the source dataset columns already populated.

Let's recreate it:
git checkout a83bf7ab83e0a954dcff69bd6ef36779cf236b6b

demo script where source dataset column is created by Splink - everything works

from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 2

df_left = df[df["side"] == 0]
df_right = df[df["side"] == 1]



linker = DuckDBLinker(
    [df_left, df_right], settings, input_table_aliases=["df_left", "df_right"]
)

linker.predict().as_pandas_dataframe(limit=2)

three dataset works

from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    # "source_dataset_column_name": "src_dataset",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 3

df_1 = df[df["side"] == 0]
# df_left["src_dataset"] = "left"

df_2 = df[df["side"] == 1]
# df_right["src_dataset"] = "right"

df_3 = df[df["side"] == 2]


linker = DuckDBLinker(
    [df_1, df_2, df_3], settings, input_table_aliases=["df_1", "df_2", "df_3"])
linker.debug_mode = True
linker.predict().as_pandas_dataframe(limit=2)

demo script where source dataset column is manually set - predict() results in no output

from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    "source_dataset_column_name": "src_dataset",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 2

df_left = df[df["side"] == 0]
df_left["src_dataset"] = "left"

df_right = df[df["side"] == 1]
df_right["src_dataset"] = "right"




linker = DuckDBLinker(
    [df_left, df_right], settings, input_table_aliases=["df_left", "df_right"]
)

linker.predict().as_pandas_dataframe(limit=2)

RobinL · 2023-11-13T17:01:31Z

Further investigation of original bug.

Here's where the col gets created

--------Creating table: __splink__df_concat--------

            select 'df_left' as source_dataset, "index", "unique_id", "first_name", "surname", "dob", "city", "email", "group", "side"
            
            from df_left
             UNION ALL 
            select 'df_right' as source_dataset, "index", "unique_id", "first_name", "surname", "dob", "city", "email", "group", "side"
            
            from df_right

Need to look at behaviour in link only three dataset too

RobinL · 2023-11-13T17:13:43Z

__splink__df_concat just concats everything

Note that it creates select '_a' as source_dataset irrespective of whether it's required

It's only when you get to blocking that the source_dataset column gets dropped if source_dataset_columns_name is set

RobinL · 2023-11-13T17:16:34Z

More succincgt testing script:

from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.duckdb_comparison_library import (
    exact_match,
)

import pandas as pd


settings = {
    "probability_two_random_records_match": 0.01,
    "source_dataset_column_name": "src_dataset",
    "link_type": "link_and_dedupe",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
    ],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
}


df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

df = df.reset_index()
df["side"] = df.index % 2

df_left = df[df["side"] == 0]
df_left["src_dataset"] = "left"

df_right = df[df["side"] == 1]
df_right["src_dataset"] = "right"


linker = DuckDBLinker([df_left, df_right], settings)
linker.debug_mode = True
linker.predict().as_pandas_dataframe(limit=2)

RobinL · 2023-11-13T17:40:26Z

One definite bug before was in blocking here:

    if (
        linker._two_dataset_link_only
        and not linker._find_new_matches_mode
        and not linker._compare_two_records_mode
    ):
        source_dataset_col = linker._settings_obj._source_dataset_column_name
        # Need df_l to be the one with the lowest id to preeserve the property
        # that the left dataset is the one with the lowest concatenated id
        keys = linker._input_tables_dict.keys()
        keys = list(sorted(keys))
        df_l = linker._input_tables_dict[keys[0]]
        df_r = linker._input_tables_dict[keys[1]]
        # The problem is that you need to value contained within the source 
        # dataset colum here.
        # Where we created it, that's fine
        # but where it was user-provided, we don't know what the value is

        
        sql = f"""
        select * from __splink__df_concat_with_tf
        where {source_dataset_col} = '{df_l.templated_name}'
        """

RobinL · 2023-11-13T17:42:55Z

aha - that logic is only needed in the two dataset case, where the above makes execution more efficient

in the three+ dataset case it's implicit and you don't need to know the values in the column

   select
            "l"."src_dataset" as "src_dataset_l", "r"."src_dataset" as "src_dataset_r", "l"."unique_id" as "unique_id_l", "r"."unique_id" as "unique_id_r", "l"."first_name" as "first_name_l", "r"."first_name" as "first_name_r", "l"."surname" as "surname_l", "r"."surname" as "surname_r"
            , '0' as match_key
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            l.first_name = r.first_name
            
            where l."src_dataset" || '-__-' || l."unique_id" < r."src_dataset" || '-__-' || r."unique_id" and l."src_dataset" != r."src_dataset"

RobinL · 2023-11-13T18:27:02Z

Plan:

Undo the changes in https://github.com/moj-analytical-services/splink/pull/1193/files
Apply the fix

the fix

 if (
        linker._two_dataset_link_only
        and not linker._find_new_matches_mode
        and not linker._compare_two_records_mode
    ):
        source_dataset_col = linker._settings_obj._source_dataset_column_name
        # Need df_l to be the one with the lowest id to preeserve the property
        # that the left dataset is the one with the lowest concatenated id

        # No - that doesn't even work!!!

        keys = linker._input_tables_dict.keys()
        keys = list(sorted(keys))
        df_l = linker._input_tables_dict[keys[0]]
        df_r = linker._input_tables_dict[keys[1]]
        # The problem is that you need to value contained within the source
        # dataset colum here.
        # Where we created it, that's fine
        # but where it was user-provided, we don't know what the value is

        sql = f"""
        select * from __splink__df_concat_with_tf
        where {source_dataset_col} = (select min({source_dataset_col}) from __splink__df_concat_with_tf)
        """
        linker._enqueue_sql(sql, "__splink__df_concat_with_tf_left")

        sql = f"""
        select * from __splink__df_concat_with_tf
        where {source_dataset_col} = (select max({source_dataset_col}) from __splink__df_concat_with_tf)
        """
        linker._enqueue_sql(sql, "__splink_df_concat_with_tf_right")

Remember afterwards to experiment with removing the creation of the source_dataset in the case that a source_dataset_column_name has been provided

RobinL · 2023-11-13T18:30:06Z

Note _source_dataset_col got added later than 1193- it's not present in 1193

aha - there was a follow up
#1204

RobinL linked a pull request Nov 13, 2023 that will close this issue

Fix issue with _source_dataset_col and _source_dataset_input_column #1731

Merged

RobinL mentioned this issue Nov 14, 2023

Fix issue with _source_dataset_col and _source_dataset_input_column #1731

Merged

RobinL closed this as completed in #1731 Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `settings_obj._source_dataset_col` and `settings_obj._source_dataset_input_column` #1711

[BUG] `settings_obj._source_dataset_col` and `settings_obj._source_dataset_input_column` #1711

RobinL commented Nov 8, 2023 •

edited

Loading

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 13, 2023 •

edited

Loading

[BUG] settings_obj._source_dataset_col and settings_obj._source_dataset_input_column #1711

[BUG] settings_obj._source_dataset_col and settings_obj._source_dataset_input_column #1711

Comments

RobinL commented Nov 8, 2023 • edited Loading

RobinL commented Nov 13, 2023 • edited Loading

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023 • edited Loading

RobinL commented Nov 13, 2023

RobinL commented Nov 13, 2023 • edited Loading

RobinL commented Nov 13, 2023 • edited Loading

[BUG] `settings_obj._source_dataset_col` and `settings_obj._source_dataset_input_column` #1711

[BUG] `settings_obj._source_dataset_col` and `settings_obj._source_dataset_input_column` #1711

RobinL commented Nov 8, 2023 •

edited

Loading

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 13, 2023 •

edited

Loading