Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tf table init on settings load #1604

Closed
wants to merge 7 commits into from

Conversation

ThomasHepworth
Copy link
Contributor

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Addresses the issue outlined here - #802.

Give a brief description for the solution you have provided

Adds support for two new methods:

  • _queue_term_frequency_tables
  • _materialise_tf_tables

_materialise_tf_tables

This method is the simpler of the two, calling compute_tf_table on a selection of input columns defined by the user or alternatively all columns requiring term frequency adjustments identified in the settings. This function always materialises the table (as this is the functionality implemented in compute_tf_table).

_queue_term_frequency_tables

This also does what it says on the tin, queueing term frequency tables for future execution. I've set this to only queue those tables identified in the settings object, as I don't foresee any circumstances where we would want to expand this to tables outside of the settings (our existing methods/functions all rely on tf tables found in concat_with_tf).

This method works both when:

  • No tf tables exist
  • One or more required tf tables exist

In the former case, we simply create the concat_with_tf table and then infer all of the required term frequencies from this.

In the latter, we queue up the previously generated term frequency tables (first_name_tf, for example) before queueing up any missing tables as SQL CTEs.

It's worth noting that this method can be made to work without the concat_with_tf materialisation step (if you look at my commit history this was originally the case. I've opted to add the mat step in as concat_with_tf is a core table used throughout the whole of splink and also allows future runs of _queue_term_frequency_tables to execute more quickly.

PR Checklist

I will update the docs tomorrow w/ the above change.

  • Added documentation for changes - I will update the docs with a few guides over the next week
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@ThomasHepworth ThomasHepworth linked an issue Oct 3, 2023 that may be closed by this pull request
@RobinL RobinL self-assigned this Oct 11, 2023
@RobinL
Copy link
Member

RobinL commented Oct 11, 2023

I'm trying to write an example script to test this and it occurs to me that it's not really possible to use this function how we might want, becaues it's not possible to instantiate the linker without any data.

I believe the issue the user faces is when they want to use compare_two_records without having done any other splink operations (like linker.predict()).

This use case seems most likely when the user is in some kind of live-linkage sceario (e.g. a digital service that just needs to score two records)and doesn't want to load in a potentially massive datasets with which they trained the model.

So how can we achieve this?

Suppose we have compare_two_records.json

You can't do this:

linker = DuckDBLinker([], "compare_two_records.json")

because:

TypeError: reduce() of empty iterable with no initial value

You could provide aribtrary data, but this won't work as expected:

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

record_1 = {
    "unique_id": 0,
    "first_name": "Julia ",
    "surname": None,
    "dob": "2015-10-29",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

record_2 = {
    "unique_id": 1,
    "first_name": "Julia ",
    "surname": "Taylor",
    "dob": "2015-07-31",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

linker = DuckDBLinker(pd.DataFrame([record_1]), "compare_two_records.json")

linker.compare_two_records(record_1, record_2).as_pandas_dataframe()

But now the term frequency tables are incorrect because they're derived from the fake data.

image

The only way to do it correctly is to correctly load interm frequency tables from the large dataset on which the model was trained.

i.e. the current error messages are actually 'correct' in a sense - they're not useful/informative/descriptive, but in this use case, we don't actually want an error.

I'm not sure what to do about this - grateful for your thoughts re: the above

@RobinL
Copy link
Member

RobinL commented Oct 11, 2023

Here's the scripts I used to try this out:

import pandas as pd

from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings, connection=":memory:")



linker.estimate_u_using_random_sampling(target_rows=1e6)


blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


linker.save_model_to_json("compare_two_records.json")

then

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

record_1 = {
    "unique_id": 0,
    "first_name": "Julia ",
    "surname": None,
    "dob": "2015-10-29",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

record_2 = {
    "unique_id": 1,
    "first_name": "Julia ",
    "surname": "Taylor",
    "dob": "2015-07-31",
    "city": "London",
    "email": "hannah88@powers.com",
    "cluster": 0,
}

linker = DuckDBLinker(pd.DataFrame([record_1]), "compare_two_records.json")
linker.debug_mode = True
linker.compare_two_records(record_1, record_2).as_pandas_dataframe()

@RobinL
Copy link
Member

RobinL commented Oct 11, 2023

In terms of how the API should work, I think it should be more like this:

Example 1: Should throw errors:

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

linker = DuckDBLinker(input_table_or_tables=None, "compare_two_records.json")

record_1 = {"first_name": "Julia ", "surname": None, "dob": "2015-10-29", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

record_2 = {"first_name": "Julia ", "surname": "Taylor", "dob": "2015-07-31", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

linker.compare_two_records(record_1, record_2)

Example 2: Should work (assuming the city column is the only one with tf adjustments

import pandas as pd

from splink.duckdb.linker import DuckDBLinker

linker = DuckDBLinker(input_table_or_tables=None, "compare_two_records.json")

linker.register_term_frequency_lookup(df_city_tf_lookup_derived_from_full_dataset)

record_1 = {"first_name": "Julia ", "surname": None, "dob": "2015-10-29", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

record_2 = {"first_name": "Julia ", "surname": "Taylor", "dob": "2015-07-31", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }

linker.compare_two_records(record_1, record_2)

@RobinL
Copy link
Member

RobinL commented Jan 8, 2024

Closing for now, a bit unclear what the best way forward on this one is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compare_two_records needs to check whether tf tables exist
2 participants