Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple extension to term frequency adjustments for inexact matches #2020

Merged
merged 11 commits into from
Mar 18, 2024

Conversation

samkodes
Copy link
Contributor

@samkodes samkodes commented Mar 2, 2024

Type of PR

  • FEAT

Is your Pull Request linked to an existing Issue or Pull Request?

Addresses the second approach to #2006 (via settings dictionary, disabling TF inflation for a comparison level or allowing a comparison level to be the reference level for TF inflation)

Give a brief description for the solution you have provided

Via the settings dictionary, a comparison level can now have two extra boolean flags:

  • is_tf_inflation_reference_level
  • disable_tf_inflation

If disable_tf_inflation=True, the TF adjustment code uses the level's u-value rather than seeking the exact-match level's u-value.
If is_tf_inflation_reference_level=True, the TF adjustment code will use this level's u-value for other levels instead of the exact-match level's u-value.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

Sam Kaufman and others added 3 commits March 1, 2024 15:25
@RobinL
Copy link
Member

RobinL commented Mar 5, 2024

Hi Sam - thanks for this and the chat earlier.

Mulling this over I think i'm pretty comfortable with adding something like this into Splink and the settings object. Being able to override the default 'automagic' behaviour of searching the exact match level seems sensible and consistent with other parts of Splink.

I'd be grateful if you take a look at this example script - I want to check I've understood the use case. This code runs fine on this branch:

Runnable script - click to expand
import duckdb

import splink.duckdb.comparison_level_library as cll
import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker

# duckdb sql statement that creates a few rows pertaining to a few people
# using UNION ALL


sql = """
    SELECT '1' as unique_id, 'John' as first_name, ['a','b','c'] as test_array
    UNION ALL
    SELECT '2', 'Jane', ['a','e','f']
    UNION ALL
    SELECT '3', 'Jack', ['a','h','i']
    UNION ALL
    SELECT '4', 'Jill', ['a','k','l']
    UNION ALL
    SELECT '5', 'Joe', ['a','n','o']
    """

df_master = duckdb.sql(sql)
print(df_master)


sql = """
SELECT '5' as unique_id, 'John' as first_name, ['a'] as test_array
UNION ALL
SELECT '6', 'John', ['e']
UNION ALL
SELECT '7', 'Jane', ['e']
"""

df_long = duckdb.sql(sql)
print(df_long)


# Derive custom tf table
sql = """
with exploded as (
select unnest(test_array) as test_array
from df_master
)
select [test_array] as test_array, count(*)/(select count(*) from exploded) as tf_test_array
from exploded
group by test_array
"""
__splink__df_tf_test_array = duckdb.sql(sql).df()
__splink__df_tf_test_array


comparison_test_array = {
    "output_column_name": "test_array",
    "comparison_levels": [
        cll.null_level("test_array"),
        {
            "sql_condition": 'array_length(list_intersect("test_array_l", "test_array_r"))>= 1',
            "label_for_charts": "Arrays intersect",
            "tf_adjustment_column": "test_array",  # This controls which lookup to use, it's unrelated to the exact match level
            "tf_adjustment_weight": 1.0,
            "disable_tf_exact_match_detection": True,  # Don't look for an exact match level, use own level instead
        },
        cll.else_level(),
    ],
    "comparison_description": "arr",
}


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": ["1=1"],  # full cartesian product
    "comparisons": [cl.exact_match("first_name"), comparison_test_array],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(["df_long", "df_master"], settings)
linker.register_term_frequency_lookup(__splink__df_tf_test_array, "test_array")


import logging

logging.getLogger("splink").setLevel(1)
linker.predict().as_pandas_dataframe()

There's one part of our chat that I'm now not sure I'm completely clear on:

  • For the array use case, is the above example sufficient to get the behaviour you want? i.e. you only need is_tf_inflation_reference_level and you don't need disable_tf_inflation?

  • I think we agreed that for a fuzzy match level like levenshtein=1, there's not much theoretical basis for applying the inflation factor (the difference between u_levenshtein and u_exact), and arguably you should just use T_x unadjusted (take the u value from the term frequency table lookup).

  • In which case, would it be sufficient to have a single new key in the settings dictionary similar to is_tf_inflation_reference_level except it isn't 'searched for' by other comparison levels? That is, a single flag isolated to the ComparisonLevel that means: 'use the u value from this level as the reference level for the term frequency adjustments I've asked for on this level. It could be called, for instance, use_u_value_as_tf_reference_level`?

I think (?) that the single use_u_value_as_tf_reference_level could cover all use cases we envisage. The only case it doesn't cover is if you explicitly want a different level to use that level as the u reference level for the purpose of tf_inflation, but I think we agree that might be a bad idea anyway.

Sorry if I've misunderstood again!!!

@samkodes
Copy link
Contributor Author

samkodes commented Mar 5, 2024 via email

@RobinL
Copy link
Member

RobinL commented Mar 6, 2024

Great - yes, so I think we agree on the behaviour i.e. that we want the behaviour of disable_tf_inflation and that a single flag is sufficient.

It's just choosing the clearest name.

I agree with your argument that it's best to avoid the term reference or reference_level. I'm less keen on the term inflation because, whilst I think it's conceptually useful, i'm not sure it's that meaningful to external users (it needs a bit of explaining to understand it).

Perhaps disable_tf_exact_match_detection?

@samkodes
Copy link
Contributor Author

samkodes commented Mar 6, 2024 via email

@RobinL
Copy link
Member

RobinL commented Mar 6, 2024

Great. If you're happy for me to do so, I'll go ahead and make the change on this branch, and add a test (based on the example script above), then we can merge. Will prob be a few days before I get around to it - feel free to go ahead and do it yourself i you'd prefer

@samkodes
Copy link
Contributor Author

samkodes commented Mar 6, 2024

I've removed the "is_tf_inflation_reference_level" stuff and renamed "disable_tf_inflation" as above to "disable_tf_exact_match_detection" in the branch.

@RobinL RobinL requested a review from ADBond March 6, 2024 17:16
@RobinL
Copy link
Member

RobinL commented Mar 6, 2024

Alright - this is good to go from my point of view. Andy will review, and once he's happy we'll merge. Thanks very much!

@RobinL RobinL requested review from ThomasHepworth and removed request for ThomasHepworth March 18, 2024 10:45
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting to this - thanks for this, looks great.

All good from my end - I'll sort out the conflict and merge this up

@ADBond ADBond merged commit ab522e6 into moj-analytical-services:master Mar 18, 2024
10 checks passed
@RobinL RobinL mentioned this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants