-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tf table init on settings load #1604
Conversation
I'm trying to write an example script to test this and it occurs to me that it's not really possible to use this function how we might want, becaues it's not possible to instantiate the linker without any data. I believe the issue the user faces is when they want to use This use case seems most likely when the user is in some kind of live-linkage sceario (e.g. a digital service that just needs to score two records)and doesn't want to load in a potentially massive datasets with which they trained the model. So how can we achieve this? Suppose we have compare_two_records.json You can't do this:
because:
You could provide aribtrary data, but this won't work as expected:
But now the term frequency tables are incorrect because they're derived from the fake data. The only way to do it correctly is to correctly load interm frequency tables from the large dataset on which the model was trained. i.e. the current error messages are actually 'correct' in a sense - they're not useful/informative/descriptive, but in this use case, we don't actually want an error. I'm not sure what to do about this - grateful for your thoughts re: the above |
Here's the scripts I used to try this out:
then
|
In terms of how the API should work, I think it should be more like this: Example 1: Should throw errors: import pandas as pd
from splink.duckdb.linker import DuckDBLinker
linker = DuckDBLinker(input_table_or_tables=None, "compare_two_records.json")
record_1 = {"first_name": "Julia ", "surname": None, "dob": "2015-10-29", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }
record_2 = {"first_name": "Julia ", "surname": "Taylor", "dob": "2015-07-31", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }
linker.compare_two_records(record_1, record_2) Example 2: Should work (assuming the city column is the only one with tf adjustments import pandas as pd
from splink.duckdb.linker import DuckDBLinker
linker = DuckDBLinker(input_table_or_tables=None, "compare_two_records.json")
linker.register_term_frequency_lookup(df_city_tf_lookup_derived_from_full_dataset)
record_1 = {"first_name": "Julia ", "surname": None, "dob": "2015-10-29", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }
record_2 = {"first_name": "Julia ", "surname": "Taylor", "dob": "2015-07-31", "city": "London", "email": "hannah88@powers.com", "cluster": 0, }
linker.compare_two_records(record_1, record_2) |
Closing for now, a bit unclear what the best way forward on this one is |
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
Addresses the issue outlined here - #802.
Give a brief description for the solution you have provided
Adds support for two new methods:
_queue_term_frequency_tables
_materialise_tf_tables
_materialise_tf_tables
This method is the simpler of the two, calling
compute_tf_table
on a selection of input columns defined by the user or alternatively all columns requiring term frequency adjustments identified in the settings. This function always materialises the table (as this is the functionality implemented incompute_tf_table
)._queue_term_frequency_tables
This also does what it says on the tin, queueing term frequency tables for future execution. I've set this to only queue those tables identified in the settings object, as I don't foresee any circumstances where we would want to expand this to tables outside of the settings (our existing methods/functions all rely on tf tables found in
concat_with_tf
).This method works both when:
In the former case, we simply create the concat_with_tf table and then infer all of the required term frequencies from this.
In the latter, we queue up the previously generated term frequency tables (first_name_tf, for example) before queueing up any missing tables as SQL CTEs.
It's worth noting that this method can be made to work without the concat_with_tf materialisation step (if you look at my commit history this was originally the case. I've opted to add the mat step in as concat_with_tf is a core table used throughout the whole of splink and also allows future runs of
_queue_term_frequency_tables
to execute more quickly.PR Checklist
I will update the docs tomorrow w/ the above change.