-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to pass seed into estimate_u_using_random_sampling
#1161
Conversation
Test: test_2_rounds_1k_duckdbPercentage change: -28.9%
Test: test_2_rounds_1k_sqlitePercentage change: -23.6%
Click here for vega lite time series charts |
Lint fails on |
In discussion with @samnlindsay - should a seed be set by default? This would mean that all splink models are fully reproducible without the user having to think about it. One con of setting seeds by default is that users will (potentially) get a false sense of security over the accuracy of their trained u values if they are the same every time. Currently, if users are generating u from a sample that is too small, u will vary significantly which will prompt the user to create a bigger sample (if they understand what is going on under the hood). This feels like it can be flagged to users more explicitly in a warning message, or by doing multiple estimates with different seeds to generate multiple u values that can be viewed in the EDIT: issue opened at #1179 |
Well done for sorting out the spark seed stuff! |
If there's somewhere sensible, might be a good idea to document that the Spark approach will degrade performance due to the extra sort. |
Thanks both! I have made fixes and added an explanation for spark's decreased performance when a seed is set. I think this is ready to look at again |
@ADBond FYI from this PR you can see that your latest fix for autoblack has now worked 🎉 One thing is once the "lint with black" commit has gone in the tests aren't triggered on the new commit. I'm pretty sure that happened before, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ross, LGTM
On your second point - unfortunately not. I guess ideally we'd have |
We could also now add an autolinter as ruff comes with the |
Yeah that's right - any commit that is pushed from a github workflow will not trigger other workflows triggered by push/pull request etc. This is so that you don't end up in a situation where you have an infinite loop of workflows. I think we can work around that by creating and using a personal-access-token to push, instead of the default |
Closes #1155
Intended to enhance the reproducibility of models.