Remove broken EM training options #2272

ADBond · 2024-07-19T09:25:44Z

Remove two (related) options from Linker.estimate_parameters_using_expectation_maximisation().

Neither comparisons_to_deactivate nor comparison_levels_to_reverse_blocking_rule really make sense the way things are currently set up. Comparisons and ComparisonLevels are not really things the user is interacting with directly, so there is not really anything sensible a user could pass into the function for these to make sense.

The functionality could be restored in future by allowing for a sensible way for the user to refer unambiguously to comparisons + comparison levels, such as assigning each of these a unique name - see linked issue.

Closes #2016.

these do not make sense with the current setup, as there is no reasonable thing a user would pass to this function. We could restore these once we have a way for users to sensibly refer to comparisons + comparison levels

RobinL

Thanks!

lamaeldo · 2024-10-10T15:24:36Z

Does this mean there is currently no way in Splink 4 to intentionally deactivate the estimation of m for a specific column in an EM training session? Say that for an EM training session, I am blocking on a column that contains the initial of a surname (instead of using substr(sname,1,1) in the blocking rule which I assume would be quite slow), but that column is not explicitely referenced in my surname comparison, then the m value will have to be trained for surname regardless of the bias induced?

RobinL · 2024-10-10T19:52:40Z

@lamaeldo i think that's probably right. We probably overlooked it because we've never used it in practice, but I do agree there are some legitimate use cases.

There is this:
#2379

Which doesn't do exactly what you want but may help.

I think you could even set it prior to the EM training run and unset it afterwards, to get the behaviour you want.

But you'd have to try. To set it you'd have to use private methods - the following code executes without error but I haven't verified it actually produces the behaviour you want:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    max_iterations=2,
)

linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

first_name_comparison = linker._settings_obj._get_comparison_by_output_column_name("first_name")
for comparison_level in first_name_comparison.comparison_levels:
    comparison_level._fix_m_probability = True


linker.training.estimate_parameters_using_expectation_maximisation(block_on("substr(first_name, 1, 2)"))

for comparison_level in first_name_comparison.comparison_levels:
    comparison_level._fix_m_probability = False


pairwise_predictions = linker.inference.predict(threshold_match_weight=-10)

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.95
)

clusters.as_duckdbpyrelation()

remove EM training options

b28e88e

these do not make sense with the current setup, as there is no reasonable thing a user would pass to this function. We could restore these once we have a way for users to sensibly refer to comparisons + comparison levels

ADBond added Interface/API improvement model training splink4 labels Jul 19, 2024

ADBond requested a review from RobinL July 19, 2024 09:25

RobinL approved these changes Jul 19, 2024

View reviewed changes

ADBond merged commit 39b849b into splink4_dev Jul 19, 2024
25 checks passed

ADBond deleted the em-training-remove-deactivation-reverse branch July 19, 2024 12:05

ADBond mentioned this pull request Jul 25, 2024

Can't sensibly supply comparison_levels_to_reverse_blocking_rule in Splink 4 #2016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove broken EM training options #2272

Remove broken EM training options #2272

ADBond commented Jul 19, 2024

RobinL left a comment

lamaeldo commented Oct 10, 2024

RobinL commented Oct 10, 2024

Remove broken EM training options #2272

Remove broken EM training options #2272

Conversation

ADBond commented Jul 19, 2024

RobinL left a comment

Choose a reason for hiding this comment

lamaeldo commented Oct 10, 2024

RobinL commented Oct 10, 2024