Fix tf table init on settings load #1604

ThomasHepworth · 2023-09-12T22:18:49Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Addresses the issue outlined here - #802.

Give a brief description for the solution you have provided

Adds support for two new methods:

_queue_term_frequency_tables
_materialise_tf_tables

`_materialise_tf_tables`

This method is the simpler of the two, calling compute_tf_table on a selection of input columns defined by the user or alternatively all columns requiring term frequency adjustments identified in the settings. This function always materialises the table (as this is the functionality implemented in compute_tf_table).

`_queue_term_frequency_tables`

This also does what it says on the tin, queueing term frequency tables for future execution. I've set this to only queue those tables identified in the settings object, as I don't foresee any circumstances where we would want to expand this to tables outside of the settings (our existing methods/functions all rely on tf tables found in concat_with_tf).

This method works both when:

No tf tables exist
One or more required tf tables exist

In the former case, we simply create the concat_with_tf table and then infer all of the required term frequencies from this.

In the latter, we queue up the previously generated term frequency tables (first_name_tf, for example) before queueing up any missing tables as SQL CTEs.

It's worth noting that this method can be made to work without the concat_with_tf materialisation step (if you look at my commit history this was originally the case. I've opted to add the mat step in as concat_with_tf is a core table used throughout the whole of splink and also allows future runs of _queue_term_frequency_tables to execute more quickly.

PR Checklist

I will update the docs tomorrow w/ the above change.

Added documentation for changes - I will update the docs with a few guides over the next week
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Made changes based off the latest version of Splink
Run the linter

RobinL · 2023-10-11T17:54:35Z

I'm trying to write an example script to test this and it occurs to me that it's not really possible to use this function how we might want, becaues it's not possible to instantiate the linker without any data.

I believe the issue the user faces is when they want to use compare_two_records without having done any other splink operations (like linker.predict()).

This use case seems most likely when the user is in some kind of live-linkage sceario (e.g. a digital service that just needs to score two records)and doesn't want to load in a potentially massive datasets with which they trained the model.

So how can we achieve this?

Suppose we have compare_two_records.json

You can't do this:

linker = DuckDBLinker([], "compare_two_records.json")

because:

TypeError: reduce() of empty iterable with no initial value

You could provide aribtrary data, but this won't work as expected:

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

record_1 = {
    "unique_id": 0,
    "first_name": "Julia ",
    "surname": None,
    "dob": "2015-10-29",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

record_2 = {
    "unique_id": 1,
    "first_name": "Julia ",
    "surname": "Taylor",
    "dob": "2015-07-31",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

linker = DuckDBLinker(pd.DataFrame([record_1]), "compare_two_records.json")

linker.compare_two_records(record_1, record_2).as_pandas_dataframe()

But now the term frequency tables are incorrect because they're derived from the fake data.

The only way to do it correctly is to correctly load interm frequency tables from the large dataset on which the model was trained.

i.e. the current error messages are actually 'correct' in a sense - they're not useful/informative/descriptive, but in this use case, we don't actually want an error.

I'm not sure what to do about this - grateful for your thoughts re: the above

RobinL · 2023-10-11T17:59:52Z

Here's the scripts I used to try this out:

import pandas as pd

from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings, connection=":memory:")



linker.estimate_u_using_random_sampling(target_rows=1e6)


blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


linker.save_model_to_json("compare_two_records.json")

then

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

record_1 = {
    "unique_id": 0,
    "first_name": "Julia ",
    "surname": None,
    "dob": "2015-10-29",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

record_2 = {
    "unique_id": 1,
    "first_name": "Julia ",
    "surname": "Taylor",
    "dob": "2015-07-31",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

linker = DuckDBLinker(pd.DataFrame([record_1]), "compare_two_records.json")
linker.debug_mode = True
linker.compare_two_records(record_1, record_2).as_pandas_dataframe()

RobinL · 2023-10-11T18:03:09Z

In terms of how the API should work, I think it should be more like this:

Example 1: Should throw errors:

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

linker = DuckDBLinker(input_table_or_tables=None, "compare_two_records.json")

record_1 = {"first_name": "Julia ", "surname": None, "dob": "2015-10-29", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

record_2 = {"first_name": "Julia ", "surname": "Taylor", "dob": "2015-07-31", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

linker.compare_two_records(record_1, record_2)

Example 2: Should work (assuming the city column is the only one with tf adjustments

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

linker = DuckDBLinker(input_table_or_tables=None, "compare_two_records.json")

linker.register_term_frequency_lookup(df_city_tf_lookup_derived_from_full_dataset)

record_1 = {"first_name": "Julia ", "surname": None, "dob": "2015-10-29", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

record_2 = {"first_name": "Julia ", "surname": "Taylor", "dob": "2015-07-31", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

linker.compare_two_records(record_1, record_2)

RobinL · 2024-01-08T09:35:52Z

Closing for now, a bit unclear what the best way forward on this one is

ThomasHepworth added 7 commits September 12, 2023 22:57

add the _queue_term_frequency_tables method to queue tf tables

41e8173

lint linker

4838f5e

test tf tables are queued as expected

585b25e

materialise concat_with_tf

af33b1d

properly materialise concat_with_tf table

4ac816c

lint with black

8f61929

fix failing test

bcba7c6

ThomasHepworth requested review from RossKen and ADBond September 25, 2023 09:45

ThomasHepworth linked an issue Oct 3, 2023 that may be closed by this pull request

compare_two_records needs to check whether tf tables exist #802

Open

RobinL self-assigned this Oct 11, 2023

RobinL closed this Jan 8, 2024

RobinL mentioned this pull request Mar 28, 2024

Refactor and simplify how TF adjustments are made in _find_new_matches_mode and _compare_two_records_mode #2111

Merged

RobinL deleted the fix_tf_table_init_on_settings_load branch August 12, 2024 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tf table init on settings load #1604

Fix tf table init on settings load #1604

ThomasHepworth commented Sep 12, 2023

RobinL commented Oct 11, 2023 •

edited

Loading

RobinL commented Oct 11, 2023 •

edited

Loading

RobinL commented Oct 11, 2023 •

edited

Loading

RobinL commented Jan 8, 2024

Fix tf table init on settings load #1604

Fix tf table init on settings load #1604

Conversation

ThomasHepworth commented Sep 12, 2023

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

_materialise_tf_tables

_queue_term_frequency_tables

PR Checklist

RobinL commented Oct 11, 2023 • edited Loading

RobinL commented Oct 11, 2023 • edited Loading

RobinL commented Oct 11, 2023 • edited Loading

RobinL commented Jan 8, 2024

`_materialise_tf_tables`

`_queue_term_frequency_tables`

RobinL commented Oct 11, 2023 •

edited

Loading

RobinL commented Oct 11, 2023 •

edited

Loading

RobinL commented Oct 11, 2023 •

edited

Loading