Initial commit for email comparison level feature. #1277

sama-ds · 2023-06-01T09:11:14Z

Added email comparison to comparison template libraries, as well as necessary changes in duckdb and spark comparison template libraries, and added rudimentary test for this feature.

STILL TO DO:

Complete test. All gamma levels have been checked, but not full pairwise comparison. This may be needed.
Investigate compatibility with backends other than spark and duckdb

…ison to comparison template libraries, as well as necessary changes in duckdb and spark comparison template libraries, and added rurudimentary test for this feature.

RossKen · 2023-06-01T09:18:05Z

One additional thing worth doing for this one is adding email_comparison to the "out-of-the-box comparisons" topic guide which are written here

RossKen · 2023-06-01T09:20:58Z

On the other backends, from @zslade's Regex PR I think it is only Athena that has regex functionality so it can be added here.

…s, but others using incorrect m_value parameters for specific levels. Have fixed these.

ThomasHepworth · 2023-06-01T14:19:57Z

tests/test_comparison_template_lib.py

+ for gamma, id_pairs in size_gamma_lookup.items():
+ for left, right in id_pairs:
+ print(f"Checking IDs: {left}, {right}")
+ assert (
+ linker_output.loc[
+ (linker_output.unique_id_l == left)
+ & (linker_output.unique_id_r == right)
+ ]["gamma_email"].values[0]
+ == gamma
+ )


We're using this all over the place now.

Can you migrate it into its own function?

You could even add it to confest.py so it can be reused across scripts more easily. Lmk if you need more info on conftest.py

Yep agreed, this seems like a good idea

RossKen · 2023-06-02T16:18:11Z

This seems to be working pretty well. One thing I did notice is that the chart labels for JW aren't super informative e.g.

does not distinguish between the whole email vs username.

I added a parameter to specify the column name manually for the exact match call function. I will add this quickly on Monday to the fuzzy level functions on this PR to improve this chart.

I'm still not entirely sure how happy I am with the JW on full email than on username - but I will think about it a bit more to see if I can come up with an alternative

github-actions · 2023-06-02T16:21:37Z

Test: test_2_rounds_1k_duckdb

Percentage change: -25.9%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1725	2023-06-12	16:40:34	1.41733	1.38982	(detached head)	`833d4ff`	Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz	2.5939 GHz	`833d4ff`

Test: test_2_rounds_1k_sqlite

Percentage change: -21.7%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1727	2023-06-12	16:40:34	3.34603	3.33635	(detached head)	`833d4ff`	Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz	2.5939 GHz	`833d4ff`

Click here for vega lite time series charts

…l-services/splink into email_comparrison_template

RossKen · 2023-06-02T16:30:47Z

Also, there has been a slight restructure to the backend specific comparison folders do I have updated this branch to master to line up with it

…l-services/splink into email_comparrison_template

RossKen · 2023-06-05T10:21:51Z

Have pushed the changes to make the chart labels clearer for fuzzy matching

splink/comparison_template_library.py

…nishing as tests need to be run via the AP.

sama-ds · 2023-06-09T13:13:42Z

Looking into Athena linking capability- as JW is not supported I've taken the decision to leave this out of the PR. We can re-consider if it gets raised as an issue in future.

…ross multiple tests. Initial commit as not yet working for spark.

… and misleading.

…irectory structure merged in from master in the prior commit. Created a pytest fixture called test_gamma_assert to prevent repetition of asserting the gamma lookup specified and the resultant comparison levels are the same. Updated the email test to reflect this.

…d in the function doccumentaton that were corrected.

docs/comparison_library.md

…l-services/splink into email_comparrison_template

RossKen

Thanks @sama-ds for your work on this 👍 I think it is good to be merged, but I have made some changes so have a look through to double-check you are happy before hitting go 😊

Initial commit for email comparison level feature. Added email compar…

9ceb28c

…ison to comparison template libraries, as well as necessary changes in duckdb and spark comparison template libraries, and added rurudimentary test for this feature.

sama-ds requested a review from RossKen June 1, 2023 09:11

Spotted some errors within name comparisons. Some redundant parameter…

9c71966

…s, but others using incorrect m_value parameters for specific levels. Have fixed these.

sama-ds linked an issue Jun 1, 2023 that may be closed by this pull request

[FEAT] Create CTL function for email addresses #1176

Closed

ThomasHepworth reviewed Jun 1, 2023

View reviewed changes

RossKen added 2 commits June 2, 2023 17:20

Merge branch 'master' into email_comparrison_template

a8695e7

lint with black

a9c0ed5

RossKen added 3 commits June 2, 2023 17:24

fix paths to comparisons

293c1e2

Merge branch 'email_comparrison_template' of github.com:moj-analytica…

6eafc1a

…l-services/splink into email_comparrison_template

lint with black

ceb8fe0

RossKen added 2 commits June 5, 2023 11:15

improve labelling for fuzzy matches

66755e0

Merge branch 'email_comparrison_template' of github.com:moj-analytica…

6aa69e2

…l-services/splink into email_comparrison_template

RossKen reviewed Jun 5, 2023

View reviewed changes

splink/comparison_template_library.py Show resolved Hide resolved

RossKen force-pushed the master branch from 7f280c1 to 9854cc2 Compare June 8, 2023 12:17

ThomasHepworth force-pushed the master branch from 9854cc2 to cc32743 Compare June 8, 2023 12:34

Addded functionality for athena linker to run. Committing prior to fi…

e89f013

…nishing as tests need to be run via the AP.

sama-ds and others added 7 commits June 9, 2023 15:36

Adding asserting across the gamma lookup in conftest.py to be used ac…

f519c5e

…ross multiple tests. Initial commit as not yet working for spark.

As athena does not have JW functionality, these changes are redundant…

403f1fd

… and misleading.

Merge branch 'master' into email_comparrison_template

c28c06c

lint with black

0a724fe

lint with black

b2e0aad

Included new test_gamma_assert function in postcode and dob ctl's.

79c13f2

sama-ds added 2 commits June 12, 2023 11:06

lint with black

8807d7e

Added docs to topic guide for email comparison. Spotted a few mistake…

3ee6225

…d in the function doccumentaton that were corrected.

sama-ds changed the title ~~WIP: Initial commit for email comparison level feature.~~ Initial commit for email comparison level feature. Jun 12, 2023

ThomasHepworth reviewed Jun 12, 2023

View reviewed changes

docs/comparison_library.md Outdated Show resolved Hide resolved

RossKen added 8 commits June 12, 2023 16:01

minor docs changes

580b876

remove athena example

b2ce885

add jaro to email and small changes

562267e

Merge branch 'master' into email_comparrison_template

be958f2

add email_comparison to README

69f4aee

Merge branch 'email_comparrison_template' of github.com:moj-analytica…

a93b600

…l-services/splink into email_comparrison_template

polish topic guide

e916830

fix imports

a000611

RossKen approved these changes Jun 12, 2023

View reviewed changes

sama-ds merged commit e89875c into master Jun 12, 2023

sama-ds deleted the email_comparrison_template branch June 12, 2023 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial commit for email comparison level feature. #1277

Initial commit for email comparison level feature. #1277

sama-ds commented Jun 1, 2023 •

edited

Loading

RossKen commented Jun 1, 2023

RossKen commented Jun 1, 2023

ThomasHepworth Jun 1, 2023

RossKen Jun 2, 2023

RossKen commented Jun 2, 2023 •

edited

Loading

github-actions bot commented Jun 2, 2023 •

edited

Loading

RossKen commented Jun 2, 2023

RossKen commented Jun 5, 2023

sama-ds commented Jun 9, 2023

RossKen left a comment

Initial commit for email comparison level feature. #1277

Initial commit for email comparison level feature. #1277

Conversation

sama-ds commented Jun 1, 2023 • edited Loading

RossKen commented Jun 1, 2023

RossKen commented Jun 1, 2023

ThomasHepworth Jun 1, 2023

Choose a reason for hiding this comment

RossKen Jun 2, 2023

Choose a reason for hiding this comment

RossKen commented Jun 2, 2023 • edited Loading

github-actions bot commented Jun 2, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

RossKen commented Jun 2, 2023

RossKen commented Jun 5, 2023

sama-ds commented Jun 9, 2023

RossKen left a comment

Choose a reason for hiding this comment

sama-ds commented Jun 1, 2023 •

edited

Loading

RossKen commented Jun 2, 2023 •

edited

Loading

github-actions bot commented Jun 2, 2023 •

edited

Loading