-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in calculations for true negatives when using accuracy _from_column
functions
#2150
Conversation
_from_column
functions _from_column
functions
sql = f""" | ||
select | ||
*, | ||
{truth_thres_expr} as truth_threshold, | ||
case when clerical_match_score >= {threshold_actual} then 1 | ||
else 0 | ||
end | ||
as c_P, | ||
as clerical_positive, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed some of the variables to try and be as clear as possible what they mean, so that when you put debug_mode = True
on, tracing through the calculations is as simple as possible
cumulative_clerical_positives_at_or_above_threshold, | ||
|
||
cumulative_clerical_negatives_below_threshold | ||
+ {total_additional_clerical_negatives} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is where the adjustments for 'implicit negatives' are made are made
# Override the truth threshold (splink score) for any records | ||
# not found by blocking rules | ||
|
||
if positives_not_captured_by_blocking_rules_scored_as_zero: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure scores of 0 match probability where found by blocking rules is false
# First we need to calculate the number of implicit true negatives | ||
# That is, any pair of records which have a different ID in the labels | ||
# column are a negative | ||
link_type = linker._settings_obj._link_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new calculations derive the total number labels, enabling us to compute how many 'implicit negative' (ghost labels) there are
@@ -2153,6 +2153,7 @@ def truth_space_table_from_labels_column( | |||
labels_column_name, | |||
threshold_actual=0.5, | |||
match_weight_round_to_nearest: float = None, | |||
positives_not_captured_by_blocking_rules_scored_as_zero: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow the user to control this.
In some cases the user may be interested in how good their scoring model is, or to analyse the difference blocking is making (what proportion of the false positives and negatives are due to bad blocking, and what proportion are due to bad scoring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, testing strategy very neat 👍
Closes #2059. See more detailed description there.
Summary of bugs:
Implementation:
Note I have developed significantly more robust tests than the previous ones.
The new testing strategy is:
1=1
(full cartesian) blocking rule. Override scores with 0 match probability in the case the the match_key corresponds to the 1=1 case.Derive truth statistics (TP, TN, FP, FN) from this table using pandas
1=1
rule. This means we can be confident the adjustments to account for 'implicit negative labels'.