Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Run multiple u training rounds to check stability #1179

Open
RossKen opened this issue Apr 11, 2023 · 3 comments
Open

[FEAT] Run multiple u training rounds to check stability #1179

RossKen opened this issue Apr 11, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request model training

Comments

@RossKen
Copy link
Contributor

RossKen commented Apr 11, 2023

Is your proposal related to a problem?

As discussed in a separate PR adding a seed to u sampling.

If users train u with too small a sample, the only ways they can tell is by noticing if

  1. some u values are not trained
  2. u values change significantly between runs

In some cases, they could get lucky and train all u values and not even have scenario 1 to flag the issue (but it would crop up in later runs as the sample missed certain comparison levels).

Describe the solution you'd like

It would be helpful to add the ability to estimate_u_using_random_sampling to do multiple runs and compare the u values generated. This would likely be useful to set a default as multiple runs. E.g.

linker.estimate_u_using_random_sampling(max_pairs=5e6, iterations = 3)

Then this could be passed into parameter_estimate_comparisons_chart

Given the u values are used to generate m - an average would have to be taken to generate m. It would also be useful to show this final average u value in the parameter_estimate_comparisons_chart.

If the final u value is being included in parameter_estimate_comparisons_chart, it would also be useful to show the final m value as well as the individual training sessions (or at least have a parameter allowing it).

Describe alternatives you've considered

Additional context

@RossKen RossKen added the enhancement New feature or request label Apr 11, 2023
@samnlindsay
Copy link
Contributor

Summarising this issue as "what if we're (unwittingly) estimating bad u values?", my thoughts are:

  • Provide instant feedback from estimate_u_using_random_sampling
    • If seed provided, produce the relative standard error (RSE) between 0 and 100% and flag u estimates with high RSE (>20%? maybe this threshold itself could be an optional argument)
    • If seed not provided, repeat sampling several times (e.g. iterations=3 above) and observe sampling errors directly
  • Add error bars on u to parameter_estimate_comparisons_chart - note that the m values estimated by the EM algorithm are dependent on the one set of u values provided to the model for all of the training sessions. - this could be difficult for log axes (see linker.parameter_estimate_comparisons_chart() improvements #1014)
  • As ever, I will point out here that it would be useful to be able to train (some) u probabilities using EM rather than relying on direct estimation (equivalent to the Splink v2 fix_u_probabilities option here)
Example pandas code to show RSE error for u
from dataengineeringutils3.s3 import read_json_from_s3
import pandas as pd
import numpy as np

pd.options.display.max_rows = 100
pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)

path = "s3://alpha-data-linking/v4/model_training/person/dev/probation_delius/basic/2023-03-06/combined_model/settings.json"
model = read_json_from_s3(path)

sample_size = 3e8

u = {
    c["output_column_name"]:[
        p["u_probability"] for p in c["comparison_levels"] if "u_probability" in p.keys()
    ] 
    for c in model["comparisons"]
}

df = pd.DataFrame.from_dict(u, orient="index").reset_index()
df.columns = ['col', "u0", "u1", "u2", "u3", "u4", "u5", "u6"]
df = pd.wide_to_long(df, stubnames="u", i="col", j="level").reset_index()
df = df.dropna(0).sort_values(["col", "level"]).reset_index(drop=True)

# Percentage relative standard error
df["rse"] = np.sqrt((df.u * (1 - df.u)) / sample_size)/ df.u * 100

df.style.background_gradient(axis=0,subset=["u","rse"]).format('{:.1f}', subset="rse") 

Screenshot 2023-04-11 at 12 22 06

@RobinL
Copy link
Member

RobinL commented Apr 11, 2023

See also #1060 - in particular by considering the cardinality and skew of columns, you could probably estimate the max rows needed to ensure a stable estimate of u values, rather than need to iterate

@RobinL
Copy link
Member

RobinL commented Apr 11, 2023

@samnlindsay Note you can train u probabilities using em with this option

fix_u_probabilities=True,

@RossKen RossKen self-assigned this Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model training
Projects
None yet
Development

No branches or pull requests

3 participants