Simple extension to term frequency adjustments for inexact matches #2020

samkodes · 2024-03-02T23:02:50Z

Type of PR

FEAT

Is your Pull Request linked to an existing Issue or Pull Request?

Addresses the second approach to #2006 (via settings dictionary, disabling TF inflation for a comparison level or allowing a comparison level to be the reference level for TF inflation)

Give a brief description for the solution you have provided

Via the settings dictionary, a comparison level can now have two extra boolean flags:

is_tf_inflation_reference_level
disable_tf_inflation

If disable_tf_inflation=True, the TF adjustment code uses the level's u-value rather than seeking the exact-match level's u-value.
If is_tf_inflation_reference_level=True, the TF adjustment code will use this level's u-value for other levels instead of the exact-match level's u-value.

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter

…e_level and disable_tf_inflation and implemented corresponding changes to TF inflation factor calculation

RobinL · 2024-03-05T18:53:00Z

Hi Sam - thanks for this and the chat earlier.

Mulling this over I think i'm pretty comfortable with adding something like this into Splink and the settings object. Being able to override the default 'automagic' behaviour of searching the exact match level seems sensible and consistent with other parts of Splink.

I'd be grateful if you take a look at this example script - I want to check I've understood the use case. This code runs fine on this branch:

Runnable script - click to expand

import duckdb

import splink.duckdb.comparison_level_library as cll
import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker

# duckdb sql statement that creates a few rows pertaining to a few people
# using UNION ALL


sql = """
    SELECT '1' as unique_id, 'John' as first_name, ['a','b','c'] as test_array
    UNION ALL
    SELECT '2', 'Jane', ['a','e','f']
    UNION ALL
    SELECT '3', 'Jack', ['a','h','i']
    UNION ALL
    SELECT '4', 'Jill', ['a','k','l']
    UNION ALL
    SELECT '5', 'Joe', ['a','n','o']
    """

df_master = duckdb.sql(sql)
print(df_master)


sql = """
SELECT '5' as unique_id, 'John' as first_name, ['a'] as test_array
UNION ALL
SELECT '6', 'John', ['e']
UNION ALL
SELECT '7', 'Jane', ['e']
"""

df_long = duckdb.sql(sql)
print(df_long)


# Derive custom tf table
sql = """
with exploded as (
select unnest(test_array) as test_array
from df_master
)
select [test_array] as test_array, count(*)/(select count(*) from exploded) as tf_test_array
from exploded
group by test_array
"""
__splink__df_tf_test_array = duckdb.sql(sql).df()
__splink__df_tf_test_array


comparison_test_array = {
    "output_column_name": "test_array",
    "comparison_levels": [
        cll.null_level("test_array"),
        {
            "sql_condition": 'array_length(list_intersect("test_array_l", "test_array_r"))>= 1',
            "label_for_charts": "Arrays intersect",
            "tf_adjustment_column": "test_array",  # This controls which lookup to use, it's unrelated to the exact match level
            "tf_adjustment_weight": 1.0,
            "disable_tf_exact_match_detection": True,  # Don't look for an exact match level, use own level instead
        },
        cll.else_level(),
    ],
    "comparison_description": "arr",
}


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": ["1=1"],  # full cartesian product
    "comparisons": [cl.exact_match("first_name"), comparison_test_array],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(["df_long", "df_master"], settings)
linker.register_term_frequency_lookup(__splink__df_tf_test_array, "test_array")


import logging

logging.getLogger("splink").setLevel(1)
linker.predict().as_pandas_dataframe()

There's one part of our chat that I'm now not sure I'm completely clear on:

For the array use case, is the above example sufficient to get the behaviour you want? i.e. you only need is_tf_inflation_reference_level and you don't need disable_tf_inflation?
I think we agreed that for a fuzzy match level like levenshtein=1, there's not much theoretical basis for applying the inflation factor (the difference between u_levenshtein and u_exact), and arguably you should just use T_x unadjusted (take the u value from the term frequency table lookup).
In which case, would it be sufficient to have a single new key in the settings dictionary similar to is_tf_inflation_reference_level except it isn't 'searched for' by other comparison levels? That is, a single flag isolated to the ComparisonLevel that means: 'use the u value from this level as the reference level for the term frequency adjustments I've asked for on this level. It could be called, for instance, use_u_value_as_tf_reference_level`?

I think (?) that the single use_u_value_as_tf_reference_level could cover all use cases we envisage. The only case it doesn't cover is if you explicitly want a different level to use that level as the u reference level for the purpose of tf_inflation, but I think we agree that might be a bad idea anyway.

Sorry if I've misunderstood again!!!

samkodes · 2024-03-05T21:48:09Z

Hi Robin, Thanks for this. I believe the code implements the current use case I am considering. Regarding "is_tf_inflation_reference_level" - yes, this works with my patch. In the case of a single comparison level, there is no difference in result between using this and the "disable_tf_inflation" flag. If there is more than one comparison level using the same tf_adjustment_column using this option will affect the other level though. Regarding your proposed "use_u_value_as_tf_reference_level", this essentially the same behaviour and implementation as my "disable_tf_inflation" flag in the patch. I am no longer sure there is a reason for the "is_tf_inflation_reference_level" behaviour - i.e. other levels searching for this level - at all. Because of this I would prefer the "disable_tf_inflation" name because I don't think we need a "reference level" for anything! But either name is ok. Sam

…

On Tue, Mar 5, 2024 at 1:53 PM Robin Linacre ***@***.***> wrote: Hi Sam - thanks for this and the chat earlier. Mulling this over I think i'm pretty comfortable with adding something like this into Splink and the settings object. Being able to override the default 'automagic' behaviour of searching the exact match level seems sensible and consistent with other parts of Splink. I'd be grateful if you take a look at this example script - I want to check I've understood the use case. This code runs fine on this branch: Runnable script - click to expand import duckdb import splink.duckdb.comparison_level_library as cllimport splink.duckdb.comparison_library as clfrom splink.duckdb.linker import DuckDBLinker # duckdb sql statement that creates a few rows pertaining to a few people# using UNION ALL sql = """ SELECT '1' as unique_id, 'John' as first_name, ['a','b','c'] as test_array UNION ALL SELECT '2', 'Jane', ['a','e','f'] UNION ALL SELECT '3', 'Jack', ['a','h','i'] UNION ALL SELECT '4', 'Jill', ['a','k','l'] UNION ALL SELECT '5', 'Joe', ['a','n','o'] """ df_master = duckdb.sql(sql)print(df_master) sql = """SELECT '5' as unique_id, 'John' as first_name, ['a'] as test_arrayUNION ALLSELECT '6', 'John', ['e']UNION ALLSELECT '7', 'Jane', ['e']""" df_long = duckdb.sql(sql)print(df_long) # Derive custom tf tablesql = """with exploded as (select unnest(test_array) as test_arrayfrom df_master)select [test_array] as test_array, count(*)/(select count(*) from exploded) as tf_test_arrayfrom explodedgroup by test_array"""__splink__df_tf_test_array = duckdb.sql(sql).df()__splink__df_tf_test_array comparison_test_array = { "output_column_name": "test_array", "comparison_levels": [ cll.null_level("test_array"), { "sql_condition": 'array_length(list_intersect("test_array_l", "test_array_r"))>= 1', "label_for_charts": "Arrays intersect", "tf_adjustment_column": "test_array", # This controls which lookup to use, it's unrelated to the exact match level "tf_adjustment_weight": 1.0, "is_tf_inflation_reference_level": True, # Don't look for an exact match level, use own level instead }, cll.else_level(), ], "comparison_description": "arr", } settings = { "probability_two_random_records_match": 0.01, "link_type": "link_only", "blocking_rules_to_generate_predictions": ["1=1"], # full cartesian product "comparisons": [cl.exact_match("first_name"), comparison_test_array], "retain_intermediate_calculation_columns": True, } linker = DuckDBLinker(["df_long", "df_master"], settings)linker.register_term_frequency_lookup(__splink__df_tf_test_array, "test_array") import logging logging.getLogger("splink").setLevel(1)linker.predict().as_pandas_dataframe() There's one part of our chat that I'm now not sure I'm completely clear on: - For the array use case, is the above example sufficient to get the behaviour you want? i.e. you only need is_tf_inflation_reference_level and you don't need disable_tf_inflation? - I think we agreed that for a fuzzy match level like levenshtein=1, there's not much theoretical basis for applying the inflation factor (the difference between u_levenshtein and u_exact), and arguably you should just use T_x unadjusted (take the u value from the term frequency table lookup). - In which case, would it be sufficient to have a single new key in the settings dictionary similar to is_tf_inflation_reference_level except it isn't 'searched for' by other comparison levels? That is, a single flag isolated to the ComparisonLevel that means: 'use the u value from this level as the reference level for the term frequency adjustments I've asked for on this level. It could be called, for instance, use_u_value_as_tf_reference_level`? Sorry if I've misunderstood again!!! — Reply to this email directly, view it on GitHub <#2020 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF7DD4W7F457OEGUXRXPA2DYWYIBRAVCNFSM6AAAAABEDN6GVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGQZTIMBTGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

RobinL · 2024-03-06T08:22:29Z

Great - yes, so I think we agree on the behaviour i.e. that we want the behaviour of disable_tf_inflation and that a single flag is sufficient.

It's just choosing the clearest name.

I agree with your argument that it's best to avoid the term reference or reference_level. I'm less keen on the term inflation because, whilst I think it's conceptually useful, i'm not sure it's that meaningful to external users (it needs a bit of explaining to understand it).

Perhaps disable_tf_exact_match_detection?

samkodes · 2024-03-06T11:50:14Z

Sounds fine to me!

…

On Wed, Mar 6, 2024, 3:22 a.m. Robin Linacre ***@***.***> wrote: Great - yes, so I think we agree on the behaviour i.e. that we want the behaviour of disable_tf_inflation and that a single flag is sufficient. It's just choosing the clearest name. I agree with your argument that it's best to avoid the term reference or reference_level. I'm less keen on the term inflation because, whilst I think it's conceptually useful, i'm not sure it's that meaningful to external users (it needs a bit of explaining to understand it). Perhaps disable_tf_exact_match_detection? — Reply to this email directly, view it on GitHub <#2020 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF7DD4UGGS4JW2ANX3W5UKTYW3G5FAVCNFSM6AAAAABEDN6GVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGMYTOMZSG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

RobinL · 2024-03-06T13:16:04Z

Great. If you're happy for me to do so, I'll go ahead and make the change on this branch, and add a test (based on the example script above), then we can merge. Will prob be a few days before I get around to it - feel free to go ahead and do it yourself i you'd prefer

…and removed 'is_tf_reference_level' behaviour.

samkodes · 2024-03-06T14:38:21Z

I've removed the "is_tf_inflation_reference_level" stuff and renamed "disable_tf_inflation" as above to "disable_tf_exact_match_detection" in the branch.

RobinL · 2024-03-06T17:22:16Z

Alright - this is good to go from my point of view. Andy will review, and once he's happy we'll merge. Thanks very much!

ADBond

Sorry for the delay in getting to this - thanks for this, looks great.

All good from my end - I'll sort out the conflict and merge this up

Sam Kaufman and others added 3 commits March 1, 2024 15:25

Added boolean flags to comparison levels for is_tf_inflation_referenc…

37e89f0

…e_level and disable_tf_inflation and implemented corresponding changes to TF inflation factor calculation

Fixed missing commas in settings JSON schema

5ebc6ce

Merge branch 'moj-analytical-services:master' into master

b9e1ca4

samkodes mentioned this pull request Mar 2, 2024

[FEAT] Exact match level required for TF adjustment - can this be avoided? #2006

Open

samnlindsay mentioned this pull request Mar 4, 2024

Threshold selection tool #2003

Merged

10 tasks

Sam Kaufman added 2 commits March 6, 2024 09:34

Renamed 'disable_tf_inflation' to 'disable_tf_exact_match_detection' …

de66761

…and removed 'is_tf_reference_level' behaviour.

Merge branch 'master' of https://github.com/samkodes/splink_tf_inexact

6c8148f

RobinL added 5 commits March 6, 2024 16:20

lint to minimise diff

4472180

line length

80efc6a

add tests

0d21c98

Merge branch 'master' into master

f4e0307

update changelog

d15cd96

RobinL requested a review from ADBond March 6, 2024 17:16

RobinL requested review from ThomasHepworth and removed request for ThomasHepworth March 18, 2024 10:45

ADBond approved these changes Mar 18, 2024

View reviewed changes

Merge branch 'master' into master

73589dd

ADBond merged commit ab522e6 into moj-analytical-services:master Mar 18, 2024
10 checks passed

RobinL mentioned this pull request Apr 2, 2024

Merge 3 4 2123 #2124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple extension to term frequency adjustments for inexact matches #2020

Simple extension to term frequency adjustments for inexact matches #2020

samkodes commented Mar 2, 2024

RobinL commented Mar 5, 2024 •

edited

Loading

samkodes commented Mar 5, 2024 via email

RobinL commented Mar 6, 2024

samkodes commented Mar 6, 2024 via email

RobinL commented Mar 6, 2024

samkodes commented Mar 6, 2024

RobinL commented Mar 6, 2024

ADBond left a comment

Simple extension to term frequency adjustments for inexact matches #2020

Simple extension to term frequency adjustments for inexact matches #2020

Conversation

samkodes commented Mar 2, 2024

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

RobinL commented Mar 5, 2024 • edited Loading

samkodes commented Mar 5, 2024 via email

RobinL commented Mar 6, 2024

samkodes commented Mar 6, 2024 via email

RobinL commented Mar 6, 2024

samkodes commented Mar 6, 2024

RobinL commented Mar 6, 2024

ADBond left a comment

Choose a reason for hiding this comment

RobinL commented Mar 5, 2024 •

edited

Loading