Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to pass seed into estimate_u_using_random_sampling #1161

Merged
merged 18 commits into from
Apr 12, 2023

Conversation

RossKen
Copy link
Contributor

@RossKen RossKen commented Mar 31, 2023

Closes #1155

Intended to enhance the reproducibility of models.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 31, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -28.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1554 2023-04-11 17:20:14 1.35759 1.33335 (detached head) 5517e52 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 5517e52

Test: test_2_rounds_1k_sqlite

Percentage change: -23.6%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1556 2023-04-11 17:20:14 3.25733 3.2557 (detached head) 5517e52 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 5517e52

Click here for vega lite time series charts

@RossKen RossKen marked this pull request as ready for review April 10, 2023 09:30
@RossKen
Copy link
Contributor Author

RossKen commented Apr 10, 2023

Lint fails on import itertools in misc.py, saying it is not used. However, it looks to be used in the all_letter_combos function at line 84.

@RossKen
Copy link
Contributor Author

RossKen commented Apr 11, 2023

In discussion with @samnlindsay - should a seed be set by default? This would mean that all splink models are fully reproducible without the user having to think about it.
Would have to rework the code slightly as Athena and SQLite currently error out with a seed.

One con of setting seeds by default is that users will (potentially) get a false sense of security over the accuracy of their trained u values if they are the same every time. Currently, if users are generating u from a sample that is too small, u will vary significantly which will prompt the user to create a bigger sample (if they understand what is going on under the hood). This feels like it can be flagged to users more explicitly in a warning message, or by doing multiple estimates with different seeds to generate multiple u values that can be viewed in the parameter_estimate_comparisons_chart, I will raise a separate issue for this. For now, a varying u is the only place that a user will see the implications of too small a sample, so perhaps worth leaving the default without a seed until that has been resolved.

EDIT: issue opened at #1179

@ThomasHepworth
Copy link
Contributor

Well done for sorting out the spark seed stuff!

@RobinL
Copy link
Member

RobinL commented Apr 11, 2023

If there's somewhere sensible, might be a good idea to document that the Spark approach will degrade performance due to the extra sort.

@RossKen
Copy link
Contributor Author

RossKen commented Apr 11, 2023

Thanks both!

I have made fixes and added an explanation for spark's decreased performance when a seed is set. I think this is ready to look at again

@RossKen
Copy link
Contributor Author

RossKen commented Apr 11, 2023

@ADBond FYI from this PR you can see that your latest fix for autoblack has now worked 🎉

One thing is once the "lint with black" commit has gone in the tests aren't triggered on the new commit. I'm pretty sure that happened before, right?

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ross, LGTM

@ThomasHepworth
Copy link
Contributor

#1161 (comment)

On your second point - unfortunately not. I guess ideally we'd have lint with black trigger first and then the remaining tests run once that has completed.

@ThomasHepworth
Copy link
Contributor

We could also now add an autolinter as ruff comes with the --fix argument.

@RossKen RossKen merged commit 86e2b94 into master Apr 12, 2023
@ADBond
Copy link
Contributor

ADBond commented Apr 13, 2023

One thing is once the "lint with black" commit has gone in the tests aren't triggered on the new commit. I'm pretty sure that happened before, right?

Yeah that's right - any commit that is pushed from a github workflow will not trigger other workflows triggered by push/pull request etc. This is so that you don't end up in a situation where you have an infinite loop of workflows. I think we can work around that by creating and using a personal-access-token to push, instead of the default GITHUB_TOKEN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Make estimate_u_using_random_sampling reproducible
4 participants