-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and simplify how TF adjustments are made in _find_new_matches_mode
and _compare_two_records_mode
#2111
Conversation
left_joins.append(sql) | ||
elif "__splink__df_concat_with_tf" in cache: | ||
subquery = f""" | ||
select distinct {col.name}, {col.tf_name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the logic that derives tf columns from __splink__df_concat_with_tf
to here
This logic is no longer used in linker.compute_tf_table
because it's rare to need to 'backwards induce' tf tables in this way, and it adds complexity for little computational gain (tf tables are fast to compute)
splink/term_frequencies.py
Outdated
@@ -75,14 +75,58 @@ def _join_tf_to_input_df_sql(linker: Linker): | |||
return sql | |||
|
|||
|
|||
def term_frequencies_from_concat_with_tf(input_column): | |||
def _join_new_table_to_df_concat_with_tf_sql(linker: Linker, new_tablename) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we were reusing logic in _join_tf_to_input_df_sql
by performing string replacement operations on its outputs.
This was confusing. Also these two functions have diverged more with this PR, so it's no longer easy to have a single function
@@ -127,6 +127,7 @@ def enqueue_df_concat(linker: Linker, pipeline: CTEPipeline) -> CTEPipeline: | |||
# so if it exists, use it instead | |||
elif "__splink__df_concat_with_tf" in cache: | |||
nodes_with_tf = cache.get_with_logging("__splink__df_concat_with_tf") | |||
nodes_with_tf.templated_name = "__splink__df_concat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a separate bug I found while doing this PR
_find_new_matches_mode
and _compare_two_records_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All seems sensible - looks good to me 👍
This is the first step in a PR to remove the need for:
linker._find_new_matches_mode
linker._compare_two_records_mode
Once I began working on this, I realised that to remove these flags, we first need to simplify the tf logic.
It also resolves a longstanding bug with
linker.compare_two_records
whereby it would only work if term frequency tables were present, see here#802 is therefore solved in Splink 4.
What does this do?
_find_new_matches_mode
and_compare_two_records_mode
new records are submitted by the user. Suppose term frequency adjustments are requested for thefirst_name
column.tf_first_name
for these records__splink__df_tf_first_name
) is registered, then we can simply left join the table of new records__splink__df_concat_with_tf
is not available, we simply create te table with no adjustmentsWhat do we have to be careful about?
The
compare_two_records
function is more fiddly than it looks because...(See this comment)
This PR makes this possible:
i.e. we can use a trained model without needing to provide a big input dataset and computing
__splink__df_concat_with_tf
.This makes real time scoring easier: It would be possible to do this in a live API:
Other notes
(You now get a warning as follows if the tf table doesn't exist. You still get prediction results, just without any tf adjustments)
A previous attempt at this was made here:
#1604