Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in calculations for true negatives when using accuracy _from_column functions #2150

Merged
merged 14 commits into from
May 8, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Apr 24, 2024

Closes #2059. See more detailed description there.

Summary of bugs:

  • Previously, if a pairwise comparison was labelled, then it was scored according to the Splink model, irrespective of whether it was picked up by blocking rules. As a result, accuracy analysis was a measure of the scoring model's ability to find matches, as opposed to a measure of the performance of the 'overall model' inc. blocking
  • Previously, when labelling for a column, no account was taken of 'implicit negative labels'. i.e. a label column gives you a comprehensive list of true positives, and all the others are implicitly negative. Where negatives were not picked up by the blocking rules, there were completely ignored.

Implementation:

Note I have developed significantly more robust tests than the previous ones.

The new testing strategy is:

  • Use a small input dataframe
  • Get Splink to compute using the model's blocking rules and finally a 1=1 (full cartesian) blocking rule. Override scores with 0 match probability in the case the the match_key corresponds to the 1=1 case.
    Derive truth statistics (TP, TN, FP, FN) from this table using pandas
  • Compare results with the Splink functions (whereby Splink is using blocking rules, as opposed to a 1=1 rule. This means we can be confident the adjustments to account for 'implicit negative labels'.

@RobinL RobinL changed the title (WIP) Fix bugs in calculations for true negatives when using accuracy _from_column functions Fix bugs in calculations for true negatives when using accuracy _from_column functions Apr 25, 2024
sql = f"""
select
*,
{truth_thres_expr} as truth_threshold,
case when clerical_match_score >= {threshold_actual} then 1
else 0
end
as c_P,
as clerical_positive,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed some of the variables to try and be as clear as possible what they mean, so that when you put debug_mode = True on, tracing through the calculations is as simple as possible

cumulative_clerical_positives_at_or_above_threshold,

cumulative_clerical_negatives_below_threshold
+ {total_additional_clerical_negatives}
Copy link
Member Author

@RobinL RobinL Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is where the adjustments for 'implicit negatives' are made are made

# Override the truth threshold (splink score) for any records
# not found by blocking rules

if positives_not_captured_by_blocking_rules_scored_as_zero:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure scores of 0 match probability where found by blocking rules is false

# First we need to calculate the number of implicit true negatives
# That is, any pair of records which have a different ID in the labels
# column are a negative
link_type = linker._settings_obj._link_type
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new calculations derive the total number labels, enabling us to compute how many 'implicit negative' (ghost labels) there are

@@ -2153,6 +2153,7 @@ def truth_space_table_from_labels_column(
labels_column_name,
threshold_actual=0.5,
match_weight_round_to_nearest: float = None,
positives_not_captured_by_blocking_rules_scored_as_zero: bool = True,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow the user to control this.

In some cases the user may be interested in how good their scoring model is, or to analyse the difference blocking is making (what proportion of the false positives and negatives are due to bad blocking, and what proportion are due to bad scoring)

@RobinL RobinL requested a review from ADBond April 30, 2024 12:38
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, testing strategy very neat 👍

@RobinL RobinL merged commit 2a4d6ab into splink4_dev May 8, 2024
15 checks passed
@RobinL RobinL deleted the fix_2059 branch May 8, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants