[FEAT] Run multiple u training rounds to check stability #1179

RossKen · 2023-04-11T10:21:24Z

Is your proposal related to a problem?

As discussed in a separate PR adding a seed to u sampling.

If users train u with too small a sample, the only ways they can tell is by noticing if

some u values are not trained
u values change significantly between runs

In some cases, they could get lucky and train all u values and not even have scenario 1 to flag the issue (but it would crop up in later runs as the sample missed certain comparison levels).

Describe the solution you'd like

It would be helpful to add the ability to estimate_u_using_random_sampling to do multiple runs and compare the u values generated. This would likely be useful to set a default as multiple runs. E.g.

linker.estimate_u_using_random_sampling(max_pairs=5e6, iterations = 3)

Then this could be passed into parameter_estimate_comparisons_chart

Given the u values are used to generate m - an average would have to be taken to generate m. It would also be useful to show this final average u value in the parameter_estimate_comparisons_chart.

If the final u value is being included in parameter_estimate_comparisons_chart, it would also be useful to show the final m value as well as the individual training sessions (or at least have a parameter allowing it).

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

samnlindsay · 2023-04-11T11:24:01Z

Summarising this issue as "what if we're (unwittingly) estimating bad u values?", my thoughts are:

Provide instant feedback from estimate_u_using_random_sampling
- If seed provided, produce the relative standard error (RSE) between 0 and 100% and flag u estimates with high RSE (>20%? maybe this threshold itself could be an optional argument)
- If seed not provided, repeat sampling several times (e.g. iterations=3 above) and observe sampling errors directly
Add error bars on u to parameter_estimate_comparisons_chart - note that the m values estimated by the EM algorithm are dependent on the one set of u values provided to the model for all of the training sessions. - this could be difficult for log axes (see linker.parameter_estimate_comparisons_chart() improvements #1014)
As ever, I will point out here that it would be useful to be able to train (some) u probabilities using EM rather than relying on direct estimation (equivalent to the Splink v2 fix_u_probabilities option here)

Example pandas code to show RSE error for u

from dataengineeringutils3.s3 import read_json_from_s3
import pandas as pd
import numpy as np

pd.options.display.max_rows = 100
pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)

path = "s3://alpha-data-linking/v4/model_training/person/dev/probation_delius/basic/2023-03-06/combined_model/settings.json"
model = read_json_from_s3(path)

sample_size = 3e8

u = {
    c["output_column_name"]:[
        p["u_probability"] for p in c["comparison_levels"] if "u_probability" in p.keys()
    ] 
    for c in model["comparisons"]
}

df = pd.DataFrame.from_dict(u, orient="index").reset_index()
df.columns = ['col', "u0", "u1", "u2", "u3", "u4", "u5", "u6"]
df = pd.wide_to_long(df, stubnames="u", i="col", j="level").reset_index()
df = df.dropna(0).sort_values(["col", "level"]).reset_index(drop=True)

# Percentage relative standard error
df["rse"] = np.sqrt((df.u * (1 - df.u)) / sample_size)/ df.u * 100

df.style.background_gradient(axis=0,subset=["u","rse"]).format('{:.1f}', subset="rse")

RobinL · 2023-04-11T21:02:40Z

See also #1060 - in particular by considering the cardinality and skew of columns, you could probably estimate the max rows needed to ensure a stable estimate of u values, rather than need to iterate

RobinL · 2023-04-11T21:04:35Z

@samnlindsay Note you can train u probabilities using em with this option

splink/splink/linker.py

Line 1136 in 021813b

fix_u_probabilities=True,

RossKen added the enhancement New feature or request label Apr 11, 2023

RossKen mentioned this issue Apr 11, 2023

Add option to pass seed into estimate_u_using_random_sampling #1161

Merged

RossKen added the model training label Apr 13, 2023

RossKen self-assigned this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Run multiple u training rounds to check stability #1179

[FEAT] Run multiple u training rounds to check stability #1179

RossKen commented Apr 11, 2023

samnlindsay commented Apr 11, 2023

RobinL commented Apr 11, 2023 •

edited

Loading

RobinL commented Apr 11, 2023 •

edited

Loading

[FEAT] Run multiple u training rounds to check stability #1179

[FEAT] Run multiple u training rounds to check stability #1179

Comments

RossKen commented Apr 11, 2023

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

samnlindsay commented Apr 11, 2023

RobinL commented Apr 11, 2023 • edited Loading

RobinL commented Apr 11, 2023 • edited Loading

RobinL commented Apr 11, 2023 •

edited

Loading

RobinL commented Apr 11, 2023 •

edited

Loading