Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comparison template library functions for simple name column #1125

Merged
merged 12 commits into from
Mar 22, 2023

Conversation

RossKen
Copy link
Contributor

@RossKen RossKen commented Mar 17, 2023

Comparison template library Function name_comparison to give a simple interface for comparing individual name columns. Plan to create a first name and surname comparison, but this is a bit more fiddly so will do in a separate PR.

Default values give the same output as jaro_winkler_at_thresholds([0.95, 0.88]) but gives more flexibility with dmetaphone cols and other string distance metrics in the parameters. Releasing for only spark and duckdb at this stage. Need to think about a more general solution for the likes of AthenaLinker as not all the comparator functions are available.

@RossKen RossKen marked this pull request as draft March 17, 2023 12:56
@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -15.5%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1494 2023-03-22 17:55:37 1.60626 1.5844 (detached head) 5f353ce Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz 5f353ce

Test: test_2_rounds_1k_sqlite

Percentage change: -18.0%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1496 2023-03-22 17:55:37 3.54648 3.49298 (detached head) 5f353ce Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz 5f353ce

Click here for vega lite time series charts

@RossKen RossKen changed the title Add comparison template library functions for names Add comparison template library functions for simple name column Mar 21, 2023
@RossKen RossKen marked this pull request as ready for review March 21, 2023 16:57
@RossKen
Copy link
Contributor Author

RossKen commented Mar 21, 2023

Lint failing due to issue fixed in #1131

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just a few minor points, but happy for you to adjust as you see fir and then merge

@RossKen RossKen merged commit 8852aca into master Mar 22, 2023
@RossKen RossKen deleted the names_ctl branch April 19, 2023 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants