Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Make estimate_u_using_random_sampling reproducible #1155

Closed
samnlindsay opened this issue Mar 30, 2023 · 5 comments · Fixed by #1161
Closed

[FEAT] Make estimate_u_using_random_sampling reproducible #1155

samnlindsay opened this issue Mar 30, 2023 · 5 comments · Fixed by #1161
Assignees
Labels
enhancement New feature or request model training

Comments

@samnlindsay
Copy link
Contributor

Is your proposal related to a problem?

Models aren't currently fully reproducible due to estimate_u_using_random_sampling - the u values can be estimated differently each time.

Describe the solution you'd like

There should be a seed option to ensure the same data and code can reproduce the same sample from which to estimate u.

Additional context

I think this might be a fairly critical point of failure for splink, seeing as it is the only means of generating u values (I don't believe u can be estimated using EM anymore? Previous versions estimated m, u and lambda using EM, with an option to fix some initial values if desired). Given that, I expect there will be use cases where a suitable sample size is far from clear. U values based on very small subsamples could be highly variable and their accuracy is currently unknown and untested. A single random sample is insufficient for me to be fully confident in the u values being used.

@samnlindsay samnlindsay added enhancement New feature or request model training labels Mar 30, 2023
@RossKen
Copy link
Contributor

RossKen commented Mar 30, 2023

Good point - I hadn't considered this before. As you say, some sort of seed option should theoretically solve this, but I will think about it further

@RossKen
Copy link
Contributor

RossKen commented Mar 31, 2023

FYI @samnlindsay I have got this working for DuckDB in #1161 but spark SQL doesn't appear to have support for seeds in the TABLESAMPLE function that we use so will take a bit more digging around.

@RossKen RossKen self-assigned this Mar 31, 2023
@James-Osmond
Copy link
Contributor

@RossKen We are using splink v3.9.1 with the DuckDB backend but the results from estimate_u_using_random_sampling() with a seed are still not the same from run to run. The $u$-probabilities change every time we run exactly the same code. This is despite the work from pull request #1161

The below image is taken from a run with a seed of 0. We see similar results to this with other seeds.

MicrosoftTeams-image

@James-Osmond
Copy link
Contributor

Hi @RossKen, did this issue ever get looked at?

@samnlindsay
Copy link
Contributor Author

@James-Osmond Yes, it was closed by #1161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants