You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Models aren't currently fully reproducible due to estimate_u_using_random_sampling - the u values can be estimated differently each time.
Describe the solution you'd like
There should be a seed option to ensure the same data and code can reproduce the same sample from which to estimate u.
Additional context
I think this might be a fairly critical point of failure for splink, seeing as it is the only means of generating u values (I don't believe u can be estimated using EM anymore? Previous versions estimated m, u and lambda using EM, with an option to fix some initial values if desired). Given that, I expect there will be use cases where a suitable sample size is far from clear. U values based on very small subsamples could be highly variable and their accuracy is currently unknown and untested. A single random sample is insufficient for me to be fully confident in the u values being used.
The text was updated successfully, but these errors were encountered:
FYI @samnlindsay I have got this working for DuckDB in #1161 but spark SQL doesn't appear to have support for seeds in the TABLESAMPLE function that we use so will take a bit more digging around.
@RossKen We are using splink v3.9.1 with the DuckDB backend but the results from estimate_u_using_random_sampling() with a seed are still not the same from run to run. The $u$-probabilities change every time we run exactly the same code. This is despite the work from pull request #1161
The below image is taken from a run with a seed of 0. We see similar results to this with other seeds.
Is your proposal related to a problem?
Models aren't currently fully reproducible due to
estimate_u_using_random_sampling
- the u values can be estimated differently each time.Describe the solution you'd like
There should be a seed option to ensure the same data and code can reproduce the same sample from which to estimate u.
Additional context
I think this might be a fairly critical point of failure for splink, seeing as it is the only means of generating u values (I don't believe u can be estimated using EM anymore? Previous versions estimated m, u and lambda using EM, with an option to fix some initial values if desired). Given that, I expect there will be use cases where a suitable sample size is far from clear. U values based on very small subsamples could be highly variable and their accuracy is currently unknown and untested. A single random sample is insufficient for me to be fully confident in the u values being used.
The text was updated successfully, but these errors were encountered: