Add option to pass seed into `estimate_u_using_random_sampling` #1161

RossKen · 2023-03-31T09:29:38Z

Intended to enhance the reproducibility of models.

github-actions · 2023-03-31T14:31:12Z

Test: test_2_rounds_1k_duckdb

Percentage change: -28.9%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1554	2023-04-11	17:20:14	1.35759	1.33335	(detached head)	`5517e52`	Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz	2.5939 GHz	`5517e52`

Test: test_2_rounds_1k_sqlite

Percentage change: -23.6%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1556	2023-04-11	17:20:14	3.25733	3.2557	(detached head)	`5517e52`	Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz	2.5939 GHz	`5517e52`

Click here for vega lite time series charts

RossKen · 2023-04-10T09:33:17Z

Lint fails on import itertools in misc.py, saying it is not used. However, it looks to be used in the all_letter_combos function at line 84.

RossKen · 2023-04-11T09:59:55Z

In discussion with @samnlindsay - should a seed be set by default? This would mean that all splink models are fully reproducible without the user having to think about it.
Would have to rework the code slightly as Athena and SQLite currently error out with a seed.

One con of setting seeds by default is that users will (potentially) get a false sense of security over the accuracy of their trained u values if they are the same every time. Currently, if users are generating u from a sample that is too small, u will vary significantly which will prompt the user to create a bigger sample (if they understand what is going on under the hood). This feels like it can be flagged to users more explicitly in a warning message, or by doing multiple estimates with different seeds to generate multiple u values that can be viewed in the parameter_estimate_comparisons_chart, I will raise a separate issue for this. For now, a varying u is the only place that a user will see the implications of too small a sample, so perhaps worth leaving the default without a seed until that has been resolved.

EDIT: issue opened at #1179

tests/test_u_train.py

ThomasHepworth · 2023-04-11T12:26:37Z

Well done for sorting out the spark seed stuff!

RobinL · 2023-04-11T12:59:02Z

If there's somewhere sensible, might be a good idea to document that the Spark approach will degrade performance due to the extra sort.

RossKen · 2023-04-11T17:25:24Z

Thanks both!

I have made fixes and added an explanation for spark's decreased performance when a seed is set. I think this is ready to look at again

splink/athena/athena_linker.py

RossKen · 2023-04-11T20:25:16Z

@ADBond FYI from this PR you can see that your latest fix for autoblack has now worked 🎉

One thing is once the "lint with black" commit has gone in the tests aren't triggered on the new commit. I'm pretty sure that happened before, right?

ThomasHepworth

Thanks Ross, LGTM

ThomasHepworth · 2023-04-12T14:39:08Z

#1161 (comment)

On your second point - unfortunately not. I guess ideally we'd have lint with black trigger first and then the remaining tests run once that has completed.

ThomasHepworth · 2023-04-12T14:42:53Z

We could also now add an autolinter as ruff comes with the --fix argument.

ADBond · 2023-04-13T08:12:19Z

One thing is once the "lint with black" commit has gone in the tests aren't triggered on the new commit. I'm pretty sure that happened before, right?

Yeah that's right - any commit that is pushed from a github workflow will not trigger other workflows triggered by push/pull request etc. This is so that you don't end up in a situation where you have an infinite loop of workflows. I think we can work around that by creating and using a personal-access-token to push, instead of the default GITHUB_TOKEN.

RossKen added 2 commits March 31, 2023 09:06

fix docs path link break

5d61ca1

Working for duckdb

68be1b6

RossKen marked this pull request as draft March 31, 2023 09:29

RossKen mentioned this pull request Mar 31, 2023

[FEAT] Make estimate_u_using_random_sampling reproducible #1155

Closed

RossKen added 3 commits March 31, 2023 10:47

linting

1d6b1c0

add seed parameter to all backend integration test

61286b0

add exception handling to athena and sqlite

f149d54

RossKen added 9 commits March 31, 2023 15:37

spark not working

7cf0e12

docs

bfacc59

test repeatable for spark

49cd395

test order by rand and limit

9f5f164

lint

6d1ba01

fix typo

d785f56

remove unused import

50de05f

re-simplify functions

6bce59a

readd omitted seed parameter

e3532c8

RossKen marked this pull request as ready for review April 10, 2023 09:30

RossKen requested review from ThomasHepworth and samnlindsay April 10, 2023 09:34

RossKen mentioned this pull request Apr 11, 2023

[FEAT] Run multiple u training rounds to check stability #1179

Open

ThomasHepworth reviewed Apr 11, 2023

View reviewed changes

tests/test_u_train.py Outdated Show resolved Hide resolved

ThomasHepworth reviewed Apr 11, 2023

View reviewed changes

tests/test_u_train.py Outdated Show resolved Hide resolved

RossKen added 3 commits April 11, 2023 18:06

remove print statement and parametrize for spark

9ca96d3

Merge branch 'master' into 1155-sample-seed

e419bce

add performance caveat and remove itertools import

e4b0afd

lint with black

8502d8d

RossKen requested a review from ThomasHepworth April 11, 2023 17:25

ThomasHepworth reviewed Apr 11, 2023

View reviewed changes

splink/athena/athena_linker.py Show resolved Hide resolved

ThomasHepworth approved these changes Apr 12, 2023

View reviewed changes

RossKen merged commit 86e2b94 into master Apr 12, 2023

RossKen deleted the 1155-sample-seed branch April 19, 2023 15:06

ThomasHepworth mentioned this pull request Jun 13, 2023

Create new linker._supports_seed method #1325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to pass seed into `estimate_u_using_random_sampling` #1161

Add option to pass seed into `estimate_u_using_random_sampling` #1161

RossKen commented Mar 31, 2023

github-actions bot commented Mar 31, 2023 •

edited

Loading

RossKen commented Apr 10, 2023

RossKen commented Apr 11, 2023 •

edited

Loading

ThomasHepworth commented Apr 11, 2023

RobinL commented Apr 11, 2023

RossKen commented Apr 11, 2023 •

edited

Loading

RossKen commented Apr 11, 2023

ThomasHepworth left a comment

ThomasHepworth commented Apr 12, 2023

ThomasHepworth commented Apr 12, 2023

ADBond commented Apr 13, 2023 •

edited

Loading

Add option to pass seed into estimate_u_using_random_sampling #1161

Add option to pass seed into estimate_u_using_random_sampling #1161

Conversation

RossKen commented Mar 31, 2023

github-actions bot commented Mar 31, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

RossKen commented Apr 10, 2023

RossKen commented Apr 11, 2023 • edited Loading

ThomasHepworth commented Apr 11, 2023

RobinL commented Apr 11, 2023

RossKen commented Apr 11, 2023 • edited Loading

RossKen commented Apr 11, 2023

ThomasHepworth left a comment

Choose a reason for hiding this comment

ThomasHepworth commented Apr 12, 2023

ThomasHepworth commented Apr 12, 2023

ADBond commented Apr 13, 2023 • edited Loading

Add option to pass seed into `estimate_u_using_random_sampling` #1161

Add option to pass seed into `estimate_u_using_random_sampling` #1161

github-actions bot commented Mar 31, 2023 •

edited

Loading

RossKen commented Apr 11, 2023 •

edited

Loading

RossKen commented Apr 11, 2023 •

edited

Loading

ADBond commented Apr 13, 2023 •

edited

Loading