[FEAT] Make `estimate_u_using_random_sampling` reproducible #1155

samnlindsay · 2023-03-30T11:58:18Z

Is your proposal related to a problem?

Models aren't currently fully reproducible due to estimate_u_using_random_sampling - the u values can be estimated differently each time.

Describe the solution you'd like

There should be a seed option to ensure the same data and code can reproduce the same sample from which to estimate u.

Additional context

I think this might be a fairly critical point of failure for splink, seeing as it is the only means of generating u values (I don't believe u can be estimated using EM anymore? Previous versions estimated m, u and lambda using EM, with an option to fix some initial values if desired). Given that, I expect there will be use cases where a suitable sample size is far from clear. U values based on very small subsamples could be highly variable and their accuracy is currently unknown and untested. A single random sample is insufficient for me to be fully confident in the u values being used.

The text was updated successfully, but these errors were encountered:

RossKen · 2023-03-30T13:10:59Z

Good point - I hadn't considered this before. As you say, some sort of seed option should theoretically solve this, but I will think about it further

RossKen · 2023-03-31T09:33:53Z

FYI @samnlindsay I have got this working for DuckDB in #1161 but spark SQL doesn't appear to have support for seeds in the TABLESAMPLE function that we use so will take a bit more digging around.

James-Osmond · 2023-06-19T15:28:20Z

@RossKen We are using splink v3.9.1 with the DuckDB backend but the results from estimate_u_using_random_sampling() with a seed are still not the same from run to run. The $u$-probabilities change every time we run exactly the same code. This is despite the work from pull request #1161

The below image is taken from a run with a seed of 0. We see similar results to this with other seeds.

James-Osmond · 2023-09-26T14:25:22Z

Hi @RossKen, did this issue ever get looked at?

samnlindsay · 2023-09-26T15:19:28Z

@James-Osmond Yes, it was closed by #1161

samnlindsay added enhancement New feature or request model training labels Mar 30, 2023

RossKen mentioned this issue Mar 31, 2023

Add option to pass seed into estimate_u_using_random_sampling #1161

Merged

RossKen self-assigned this Mar 31, 2023

RossKen closed this as completed in #1161 Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Make `estimate_u_using_random_sampling` reproducible #1155

[FEAT] Make `estimate_u_using_random_sampling` reproducible #1155

samnlindsay commented Mar 30, 2023

RossKen commented Mar 30, 2023

RossKen commented Mar 31, 2023

James-Osmond commented Jun 19, 2023

James-Osmond commented Sep 26, 2023

samnlindsay commented Sep 26, 2023

[FEAT] Make estimate_u_using_random_sampling reproducible #1155

[FEAT] Make estimate_u_using_random_sampling reproducible #1155

Comments

samnlindsay commented Mar 30, 2023

Is your proposal related to a problem?

Describe the solution you'd like

Additional context

RossKen commented Mar 30, 2023

RossKen commented Mar 31, 2023

James-Osmond commented Jun 19, 2023

James-Osmond commented Sep 26, 2023

samnlindsay commented Sep 26, 2023

[FEAT] Make `estimate_u_using_random_sampling` reproducible #1155

[FEAT] Make `estimate_u_using_random_sampling` reproducible #1155