Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postcode comparison template #1230

Merged
merged 21 commits into from
May 18, 2023
Merged

Postcode comparison template #1230

merged 21 commits into from
May 18, 2023

Conversation

zslade
Copy link
Contributor

@zslade zslade commented May 10, 2023

Addresses issue #215

Comparison template for postcode column. The default arguments will give a comparison with levels:
- Exact match on full postcode
- Exact match on sector
- Exact match on district
- Exact match on area
- All other comparisons

with an optional 'distance in km' comparison level

@zslade zslade requested a review from RossKen May 10, 2023 18:04
@github-actions
Copy link
Contributor

github-actions bot commented May 10, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -28.0%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1635 2023-05-18 15:27:31 1.37419 1.3496 (detached head) 9039d3a Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 9039d3a

Test: test_2_rounds_1k_sqlite

Percentage change: -24.0%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1637 2023-05-18 15:27:31 3.23967 3.23781 (detached head) 9039d3a Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 9039d3a

Click here for vega lite time series charts

@RossKen
Copy link
Contributor

RossKen commented May 10, 2023

@zslade this looks good at first look (but currently on my phone so will look at properly tomorrow).

One thing I think should be added is the ability to add multiple km distance levels.

You can see an example of how to deal with multiple levels at

levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)

So you will need to use ensure_is_iterable then distance_threshold_comparison_levels to get this working for the levels themselves.

Plus add to the comparison description similar to

if len(levenshtein_thresholds) > 0:

With distance_threshold_description

@RossKen
Copy link
Contributor

RossKen commented May 10, 2023

Also, it would be good to update the feature engineering topic guide with this function for the postcode section

@RossKen
Copy link
Contributor

RossKen commented May 10, 2023

And there are postcode columns in some of the splink demos which should be updated with this function instead

@RossKen
Copy link
Contributor

RossKen commented May 14, 2023

Hey @zslade , I was just having a skim through the fe docs. Apologies, I should have been clearer on what I was thinking here. The focus of that topic guide is on what adding additional columns can do to improve a splink model, so the one around postcodes was mainly looking at how adding lat/long could add additional levels to match on.

So here, all I was thinking was to replace the initial levenshtein_at_thresholds with postcode_comparison then changing some of the narrative around it to talk about the benefits of adding distance_in_km levels ontop the purely postcode solution. I.e. being able to compare places that are close, but in different postcode regions e.g. N London postcodes vs SW London postcodes are right beside each other but the postcode wouldn't tell you that on their own.

Apologies for any confusion - that's my bad!

Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I just pushed some small changes, and now happy for you to merge! 🎉

@zslade zslade merged commit 879eb5e into master May 18, 2023
@RossKen RossKen deleted the postcode_wrapper_template branch May 18, 2023 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants