Refactor and simplify how TF adjustments are made in _find_new_matches_mode
and _compare_two_records_mode
#2111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the first step in a PR to remove the need for:
linker._find_new_matches_mode
linker._compare_two_records_mode
Once I began working on this, I realised that to remove these flags, we first need to simplify the tf logic.
It also resolves a longstanding bug with
linker.compare_two_records
whereby it would only work if term frequency tables were present, see here#802 is therefore solved in Splink 4.
What does this do?
_find_new_matches_mode
and_compare_two_records_mode
new records are submitted by the user. Suppose term frequency adjustments are requested for thefirst_name
column.tf_first_name
for these records__splink__df_tf_first_name
) is registered, then we can simply left join the table of new records__splink__df_concat_with_tf
is not available, we simply create te table with no adjustmentsWhat do we have to be careful about?
The
compare_two_records
function is more fiddly than it looks because...(See this comment)
This PR makes this possible:
i.e. we can use a trained model without needing to provide a big input dataset and computing
__splink__df_concat_with_tf
.This makes real time scoring easier: It would be possible to do this in a live API:
Other notes
(You now get a warning as follows if the tf table doesn't exist. You still get prediction results, just without any tf adjustments)
A previous attempt at this was made here:
#1604