-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple extension to term frequency adjustments for inexact matches #2020
Conversation
…e_level and disable_tf_inflation and implemented corresponding changes to TF inflation factor calculation
Hi Sam - thanks for this and the chat earlier. Mulling this over I think i'm pretty comfortable with adding something like this into Splink and the settings object. Being able to override the default 'automagic' behaviour of searching the exact match level seems sensible and consistent with other parts of Splink. I'd be grateful if you take a look at this example script - I want to check I've understood the use case. This code runs fine on this branch: Runnable script - click to expandimport duckdb
import splink.duckdb.comparison_level_library as cll
import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker
# duckdb sql statement that creates a few rows pertaining to a few people
# using UNION ALL
sql = """
SELECT '1' as unique_id, 'John' as first_name, ['a','b','c'] as test_array
UNION ALL
SELECT '2', 'Jane', ['a','e','f']
UNION ALL
SELECT '3', 'Jack', ['a','h','i']
UNION ALL
SELECT '4', 'Jill', ['a','k','l']
UNION ALL
SELECT '5', 'Joe', ['a','n','o']
"""
df_master = duckdb.sql(sql)
print(df_master)
sql = """
SELECT '5' as unique_id, 'John' as first_name, ['a'] as test_array
UNION ALL
SELECT '6', 'John', ['e']
UNION ALL
SELECT '7', 'Jane', ['e']
"""
df_long = duckdb.sql(sql)
print(df_long)
# Derive custom tf table
sql = """
with exploded as (
select unnest(test_array) as test_array
from df_master
)
select [test_array] as test_array, count(*)/(select count(*) from exploded) as tf_test_array
from exploded
group by test_array
"""
__splink__df_tf_test_array = duckdb.sql(sql).df()
__splink__df_tf_test_array
comparison_test_array = {
"output_column_name": "test_array",
"comparison_levels": [
cll.null_level("test_array"),
{
"sql_condition": 'array_length(list_intersect("test_array_l", "test_array_r"))>= 1',
"label_for_charts": "Arrays intersect",
"tf_adjustment_column": "test_array", # This controls which lookup to use, it's unrelated to the exact match level
"tf_adjustment_weight": 1.0,
"disable_tf_exact_match_detection": True, # Don't look for an exact match level, use own level instead
},
cll.else_level(),
],
"comparison_description": "arr",
}
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "link_only",
"blocking_rules_to_generate_predictions": ["1=1"], # full cartesian product
"comparisons": [cl.exact_match("first_name"), comparison_test_array],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(["df_long", "df_master"], settings)
linker.register_term_frequency_lookup(__splink__df_tf_test_array, "test_array")
import logging
logging.getLogger("splink").setLevel(1)
linker.predict().as_pandas_dataframe() There's one part of our chat that I'm now not sure I'm completely clear on:
I think (?) that the single Sorry if I've misunderstood again!!! |
Hi Robin,
Thanks for this. I believe the code implements the current use case I am
considering.
Regarding "is_tf_inflation_reference_level" - yes, this works with my
patch. In the case of a single comparison level, there is no difference in
result between using this and the "disable_tf_inflation" flag. If there is
more than one comparison level using the same tf_adjustment_column using
this option will affect the other level though.
Regarding your proposed "use_u_value_as_tf_reference_level", this
essentially the same behaviour and implementation as my
"disable_tf_inflation" flag in the patch.
I am no longer sure there is a reason for the
"is_tf_inflation_reference_level" behaviour - i.e. other levels searching
for this level - at all.
Because of this I would prefer the "disable_tf_inflation" name because I
don't think we need a "reference level" for anything! But either name is
ok.
Sam
…On Tue, Mar 5, 2024 at 1:53 PM Robin Linacre ***@***.***> wrote:
Hi Sam - thanks for this and the chat earlier.
Mulling this over I think i'm pretty comfortable with adding something
like this into Splink and the settings object. Being able to override the
default 'automagic' behaviour of searching the exact match level seems
sensible and consistent with other parts of Splink.
I'd be grateful if you take a look at this example script - I want to
check I've understood the use case. This code runs fine on this branch:
Runnable script - click to expand
import duckdb
import splink.duckdb.comparison_level_library as cllimport splink.duckdb.comparison_library as clfrom splink.duckdb.linker import DuckDBLinker
# duckdb sql statement that creates a few rows pertaining to a few people# using UNION ALL
sql = """ SELECT '1' as unique_id, 'John' as first_name, ['a','b','c'] as test_array UNION ALL SELECT '2', 'Jane', ['a','e','f'] UNION ALL SELECT '3', 'Jack', ['a','h','i'] UNION ALL SELECT '4', 'Jill', ['a','k','l'] UNION ALL SELECT '5', 'Joe', ['a','n','o'] """
df_master = duckdb.sql(sql)print(df_master)
sql = """SELECT '5' as unique_id, 'John' as first_name, ['a'] as test_arrayUNION ALLSELECT '6', 'John', ['e']UNION ALLSELECT '7', 'Jane', ['e']"""
df_long = duckdb.sql(sql)print(df_long)
# Derive custom tf tablesql = """with exploded as (select unnest(test_array) as test_arrayfrom df_master)select [test_array] as test_array, count(*)/(select count(*) from exploded) as tf_test_arrayfrom explodedgroup by test_array"""__splink__df_tf_test_array = duckdb.sql(sql).df()__splink__df_tf_test_array
comparison_test_array = {
"output_column_name": "test_array",
"comparison_levels": [
cll.null_level("test_array"),
{
"sql_condition": 'array_length(list_intersect("test_array_l", "test_array_r"))>= 1',
"label_for_charts": "Arrays intersect",
"tf_adjustment_column": "test_array", # This controls which lookup to use, it's unrelated to the exact match level
"tf_adjustment_weight": 1.0,
"is_tf_inflation_reference_level": True, # Don't look for an exact match level, use own level instead
},
cll.else_level(),
],
"comparison_description": "arr",
}
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "link_only",
"blocking_rules_to_generate_predictions": ["1=1"], # full cartesian product
"comparisons": [cl.exact_match("first_name"), comparison_test_array],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(["df_long", "df_master"], settings)linker.register_term_frequency_lookup(__splink__df_tf_test_array, "test_array")
import logging
logging.getLogger("splink").setLevel(1)linker.predict().as_pandas_dataframe()
There's one part of our chat that I'm now not sure I'm completely clear on:
-
For the array use case, is the above example sufficient to get the
behaviour you want? i.e. you only need is_tf_inflation_reference_level
and you don't need disable_tf_inflation?
-
I think we agreed that for a fuzzy match level like levenshtein=1,
there's not much theoretical basis for applying the inflation factor (the
difference between u_levenshtein and u_exact), and arguably you should just
use T_x unadjusted (take the u value from the term frequency table lookup).
-
In which case, would it be sufficient to have a single new key in the
settings dictionary similar to is_tf_inflation_reference_level except
it isn't 'searched for' by other comparison levels? That is, a single flag
isolated to the ComparisonLevel that means: 'use the u value from this
level as the reference level for the term frequency adjustments I've asked
for on this level. It could be called, for instance,
use_u_value_as_tf_reference_level`?
Sorry if I've misunderstood again!!!
—
Reply to this email directly, view it on GitHub
<#2020 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF7DD4W7F457OEGUXRXPA2DYWYIBRAVCNFSM6AAAAABEDN6GVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGQZTIMBTGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Great - yes, so I think we agree on the behaviour i.e. that we want the behaviour of It's just choosing the clearest name. I agree with your argument that it's best to avoid the term Perhaps |
Sounds fine to me!
…On Wed, Mar 6, 2024, 3:22 a.m. Robin Linacre ***@***.***> wrote:
Great - yes, so I think we agree on the behaviour i.e. that we want the
behaviour of disable_tf_inflation and that a single flag is sufficient.
It's just choosing the clearest name.
I agree with your argument that it's best to avoid the term reference or
reference_level. I'm less keen on the term inflation because, whilst I
think it's conceptually useful, i'm not sure it's that meaningful to
external users (it needs a bit of explaining to understand it).
Perhaps disable_tf_exact_match_detection?
—
Reply to this email directly, view it on GitHub
<#2020 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF7DD4UGGS4JW2ANX3W5UKTYW3G5FAVCNFSM6AAAAABEDN6GVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGMYTOMZSG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Great. If you're happy for me to do so, I'll go ahead and make the change on this branch, and add a test (based on the example script above), then we can merge. Will prob be a few days before I get around to it - feel free to go ahead and do it yourself i you'd prefer |
…and removed 'is_tf_reference_level' behaviour.
I've removed the "is_tf_inflation_reference_level" stuff and renamed "disable_tf_inflation" as above to "disable_tf_exact_match_detection" in the branch. |
Alright - this is good to go from my point of view. Andy will review, and once he's happy we'll merge. Thanks very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in getting to this - thanks for this, looks great.
All good from my end - I'll sort out the conflict and merge this up
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
Addresses the second approach to #2006 (via settings dictionary, disabling TF inflation for a comparison level or allowing a comparison level to be the reference level for TF inflation)
Give a brief description for the solution you have provided
Via the settings dictionary, a comparison level can now have two extra boolean flags:
If disable_tf_inflation=True, the TF adjustment code uses the level's u-value rather than seeking the exact-match level's u-value.
If is_tf_inflation_reference_level=True, the TF adjustment code will use this level's u-value for other levels instead of the exact-match level's u-value.
PR Checklist