Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for email comparison level feature. #1277

Merged
merged 27 commits into from
Jun 12, 2023

Conversation

sama-ds
Copy link
Contributor

@sama-ds sama-ds commented Jun 1, 2023

Added email comparison to comparison template libraries, as well as necessary changes in duckdb and spark comparison template libraries, and added rudimentary test for this feature.

STILL TO DO:

  • Complete test. All gamma levels have been checked, but not full pairwise comparison. This may be needed.
  • Investigate compatibility with backends other than spark and duckdb

…ison to comparison template libraries, as well as necessary changes in duckdb and spark comparison template libraries, and added rurudimentary test for this feature.
@sama-ds sama-ds requested a review from RossKen June 1, 2023 09:11
@RossKen
Copy link
Contributor

RossKen commented Jun 1, 2023

One additional thing worth doing for this one is adding email_comparison to the "out-of-the-box comparisons" topic guide which are written here

@RossKen
Copy link
Contributor

RossKen commented Jun 1, 2023

On the other backends, from @zslade's Regex PR I think it is only Athena that has regex functionality so it can be added here.

…s, but others using incorrect m_value parameters for specific levels. Have fixed these.
@sama-ds sama-ds linked an issue Jun 1, 2023 that may be closed by this pull request
Comment on lines 616 to 625
for gamma, id_pairs in size_gamma_lookup.items():
for left, right in id_pairs:
print(f"Checking IDs: {left}, {right}")
assert (
linker_output.loc[
(linker_output.unique_id_l == left)
& (linker_output.unique_id_r == right)
]["gamma_email"].values[0]
== gamma
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using this all over the place now.

Can you migrate it into its own function?

You could even add it to confest.py so it can be reused across scripts more easily. Lmk if you need more info on conftest.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed, this seems like a good idea

@RossKen
Copy link
Contributor

RossKen commented Jun 2, 2023

This seems to be working pretty well. One thing I did notice is that the chart labels for JW aren't super informative e.g.
image
does not distinguish between the whole email vs username.

I added a parameter to specify the column name manually for the exact match call function. I will add this quickly on Monday to the fuzzy level functions on this PR to improve this chart.

I'm still not entirely sure how happy I am with the JW on full email than on username - but I will think about it a bit more to see if I can come up with an alternative

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -25.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1725 2023-06-12 16:40:34 1.41733 1.38982 (detached head) 833d4ff Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 833d4ff

Test: test_2_rounds_1k_sqlite

Percentage change: -21.7%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1727 2023-06-12 16:40:34 3.34603 3.33635 (detached head) 833d4ff Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 833d4ff

Click here for vega lite time series charts

@RossKen
Copy link
Contributor

RossKen commented Jun 2, 2023

Also, there has been a slight restructure to the backend specific comparison folders do I have updated this branch to master to line up with it

@RossKen
Copy link
Contributor

RossKen commented Jun 5, 2023

Have pushed the changes to make the chart labels clearer for fuzzy matching
image

@sama-ds
Copy link
Contributor Author

sama-ds commented Jun 9, 2023

Looking into Athena linking capability- as JW is not supported I've taken the decision to leave this out of the PR. We can re-consider if it gets raised as an issue in future.

sama-ds and others added 7 commits June 9, 2023 15:36
…ross multiple tests. Initial commit as not yet working for spark.
…irectory structure merged in from master in the prior commit. Created a pytest fixture called test_gamma_assert to prevent repetition of asserting the gamma lookup specified and the resultant comparison levels are the same. Updated the email test to reflect this.
@sama-ds sama-ds changed the title WIP: Initial commit for email comparison level feature. Initial commit for email comparison level feature. Jun 12, 2023
Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sama-ds for your work on this 👍 I think it is good to be merged, but I have made some changes so have a look through to double-check you are happy before hitting go 😊

@sama-ds sama-ds merged commit e89875c into master Jun 12, 2023
@sama-ds sama-ds deleted the email_comparrison_template branch June 12, 2023 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Create CTL function for email addresses
3 participants