Fix bugs in calculations for true negatives when using accuracy `_from_column` functions #2150

RobinL · 2024-04-24T08:38:16Z

Closes #2059. See more detailed description there.

Summary of bugs:

Previously, if a pairwise comparison was labelled, then it was scored according to the Splink model, irrespective of whether it was picked up by blocking rules. As a result, accuracy analysis was a measure of the scoring model's ability to find matches, as opposed to a measure of the performance of the 'overall model' inc. blocking
Previously, when labelling for a column, no account was taken of 'implicit negative labels'. i.e. a label column gives you a comprehensive list of true positives, and all the others are implicitly negative. Where negatives were not picked up by the blocking rules, there were completely ignored.

Implementation:

Note I have developed significantly more robust tests than the previous ones.

The new testing strategy is:

Use a small input dataframe
Get Splink to compute using the model's blocking rules and finally a 1=1 (full cartesian) blocking rule. Override scores with 0 match probability in the case the the match_key corresponds to the 1=1 case.
Derive truth statistics (TP, TN, FP, FN) from this table using pandas
Compare results with the Splink functions (whereby Splink is using blocking rules, as opposed to a 1=1 rule. This means we can be confident the adjustments to account for 'implicit negative labels'.

RobinL · 2024-04-25T08:23:04Z

splink/accuracy.py

    sql = f"""
    select
    *,
    {truth_thres_expr} as truth_threshold,
    case when clerical_match_score >= {threshold_actual} then 1
    else 0
    end
-    as c_P,
+    as clerical_positive,


I've renamed some of the variables to try and be as clear as possible what they mean, so that when you put debug_mode = True on, tracing through the calculations is as simple as possible

RobinL · 2024-04-25T08:23:47Z

splink/accuracy.py

+            cumulative_clerical_positives_at_or_above_threshold,
+
+            cumulative_clerical_negatives_below_threshold
+                + {total_additional_clerical_negatives}


This part is where the adjustments for 'implicit negatives' are made are made

RobinL · 2024-04-25T08:24:41Z

splink/accuracy.py

+    # Override the truth threshold (splink score) for any records
+    # not found by blocking rules
+
+    if positives_not_captured_by_blocking_rules_scored_as_zero:


Ensure scores of 0 match probability where found by blocking rules is false

RobinL · 2024-04-25T08:25:18Z

splink/accuracy.py

+    # First we need to calculate the number of implicit true negatives
+    # That is, any pair of records which have a different ID in the labels
+    # column are a negative
+    link_type = linker._settings_obj._link_type


These new calculations derive the total number labels, enabling us to compute how many 'implicit negative' (ghost labels) there are

RobinL · 2024-04-25T08:26:30Z

splink/linker.py

@@ -2153,6 +2153,7 @@ def truth_space_table_from_labels_column(
        labels_column_name,
        threshold_actual=0.5,
        match_weight_round_to_nearest: float = None,
+        positives_not_captured_by_blocking_rules_scored_as_zero: bool = True,


Allow the user to control this.

In some cases the user may be interested in how good their scoring model is, or to analyse the difference blocking is making (what proportion of the false positives and negatives are due to bad blocking, and what proportion are due to bad scoring)

ADBond

Looks good, testing strategy very neat 👍

RobinL added 13 commits April 23, 2024 10:11

clearer labelling

a90c663

wip

7ff25c3

wip

e94c399

fixed i think

7b1eced

problem with true negatives

2c67430

fix

cea5c9d

fix tests

b2c13a8

add more accuracy tests

b17f853

add better tests for from_labels_table functions

f26a41d

allow sql dialect

017ed13

allow sql dialect

490ae3a

fix spark test failures

75ea5ec

better documentation

71a53f5

RobinL changed the title ~~(WIP) Fix bugs in calculations for true negatives when using accuracy _from_column functions~~ Fix bugs in calculations for true negatives when using accuracy _from_column functions Apr 25, 2024

RobinL commented Apr 25, 2024

View reviewed changes

RobinL requested a review from ADBond April 30, 2024 12:38

Merge branch 'splink4_dev' into fix_2059

fee3f05

ADBond approved these changes May 7, 2024

View reviewed changes

RobinL merged commit 2a4d6ab into splink4_dev May 8, 2024
15 checks passed

RobinL deleted the fix_2059 branch May 8, 2024 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in calculations for true negatives when using accuracy `_from_column` functions #2150

Fix bugs in calculations for true negatives when using accuracy `_from_column` functions #2150

RobinL commented Apr 24, 2024 •

edited

Loading

RobinL Apr 25, 2024

RobinL Apr 25, 2024 •

edited

Loading

RobinL Apr 25, 2024

RobinL Apr 25, 2024

RobinL Apr 25, 2024

ADBond left a comment

Fix bugs in calculations for true negatives when using accuracy _from_column functions #2150

Fix bugs in calculations for true negatives when using accuracy _from_column functions #2150

Conversation

RobinL commented Apr 24, 2024 • edited Loading

RobinL Apr 25, 2024

Choose a reason for hiding this comment

RobinL Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

RobinL Apr 25, 2024

Choose a reason for hiding this comment

RobinL Apr 25, 2024

Choose a reason for hiding this comment

RobinL Apr 25, 2024

Choose a reason for hiding this comment

ADBond left a comment

Choose a reason for hiding this comment

Fix bugs in calculations for true negatives when using accuracy `_from_column` functions #2150

Fix bugs in calculations for true negatives when using accuracy `_from_column` functions #2150

RobinL commented Apr 24, 2024 •

edited

Loading

RobinL Apr 25, 2024 •

edited

Loading