Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove broken EM training options #2272

Merged
merged 1 commit into from
Jul 19, 2024

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Jul 19, 2024

Remove two (related) options from Linker.estimate_parameters_using_expectation_maximisation().

Neither comparisons_to_deactivate nor comparison_levels_to_reverse_blocking_rule really make sense the way things are currently set up. Comparisons and ComparisonLevels are not really things the user is interacting with directly, so there is not really anything sensible a user could pass into the function for these to make sense.

The functionality could be restored in future by allowing for a sensible way for the user to refer unambiguously to comparisons + comparison levels, such as assigning each of these a unique name - see linked issue.

Closes #2016.

these do not make sense with the current setup, as there is no reasonable thing a user would pass to this function. We could restore these once we have a way for users to sensibly refer to comparisons + comparison levels
Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ADBond ADBond merged commit 39b849b into splink4_dev Jul 19, 2024
25 checks passed
@ADBond ADBond deleted the em-training-remove-deactivation-reverse branch July 19, 2024 12:05
@lamaeldo
Copy link

Does this mean there is currently no way in Splink 4 to intentionally deactivate the estimation of m for a specific column in an EM training session? Say that for an EM training session, I am blocking on a column that contains the initial of a surname (instead of using substr(sname,1,1) in the blocking rule which I assume would be quite slow), but that column is not explicitely referenced in my surname comparison, then the m value will have to be trained for surname regardless of the bias induced?

@RobinL
Copy link
Member

RobinL commented Oct 10, 2024

@lamaeldo i think that's probably right. We probably overlooked it because we've never used it in practice, but I do agree there are some legitimate use cases.

There is this:
#2379

Which doesn't do exactly what you want but may help.

I think you could even set it prior to the EM training run and unset it afterwards, to get the behaviour you want.

But you'd have to try. To set it you'd have to use private methods - the following code executes without error but I haven't verified it actually produces the behaviour you want:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    max_iterations=2,
)

linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

first_name_comparison = linker._settings_obj._get_comparison_by_output_column_name("first_name")
for comparison_level in first_name_comparison.comparison_levels:
    comparison_level._fix_m_probability = True


linker.training.estimate_parameters_using_expectation_maximisation(block_on("substr(first_name, 1, 2)"))

for comparison_level in first_name_comparison.comparison_levels:
    comparison_level._fix_m_probability = False


pairwise_predictions = linker.inference.predict(threshold_match_weight=-10)

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.95
)

clusters.as_duckdbpyrelation()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants