Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forename Surname ctl #1174

Merged
merged 39 commits into from
May 18, 2023
Merged

Forename Surname ctl #1174

merged 39 commits into from
May 18, 2023

Conversation

RossKen
Copy link
Contributor

@RossKen RossKen commented Apr 8, 2023

No description provided.

@RossKen RossKen marked this pull request as draft April 8, 2023 17:27
@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -16.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1638 2023-05-18 23:26:53 1.58896 1.55737 (detached head) 1494dfa Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0952 GHz 1494dfa

Test: test_2_rounds_1k_sqlite

Percentage change: -9.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1640 2023-05-18 23:26:53 3.84402 3.83713 (detached head) 1494dfa Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0952 GHz 1494dfa

Click here for vega lite time series charts

@RossKen RossKen requested a review from aliceoleary0 May 5, 2023 15:06
@RossKen RossKen marked this pull request as ready for review May 5, 2023 15:06
@RossKen
Copy link
Contributor Author

RossKen commented May 5, 2023

Hey @aliceoleary0, I have cleaned this up now so would be good to have you take a look at it when you have time.

My main concern as it stands is that there are too many parameters for fuzzy match for forename and surname. Currenly there is the option to turn on levenshtein, jaro_winkler, jaccard for both. Plus I would need to add jaro and damerau-levenshtein once it has been merged. So, having potentially 10 parameters, corresponding to potentially 10 different comparison levels, and that's after all of the forename/surname combinations as well.

One thought I has was to have a single fuzzy match level for both surname and forename where users can specify their comparator function and threshold of choice combined by the new cll.or_ function. Would be keen to get your opinion though!

Don't worry about the failing tests, I assume I have broken something while cleaning this up so can have a dig into it next week.

Copy link
Contributor

@aliceoleary0 aliceoleary0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some thoughts. All code comments refer to comparison_template_library.py (sorry this file isn't in current commit so can't comment on it directly).

Think the null level should use _and rather than _or (line 719).

Also think it would be useful to include checks for when user inputs multiple fuzzy string comparison operations (as per DateComparisonBase line 208), otherwise I guess order (in terms of permissiveness) matters and this might be complicated given that different thresholds can be given for each fuzzy string operation.

Re. number of parameters - I agree that it might be overkill (given that these are supposed to be out-of-the-box templates) to have different types of fuzzy matches happening for surname and forename (i.e. might make sense to have one fuzzy match method and threshold for both).

However, the second point about whether these should be combined into an _or statement seems to me to be a separate question as "fuzzy surname" followed by "fuzzy forename" isn't the same as "fuzzy surname or fuzzy forename". Although I'm not sure in practice how much difference this would make to the model.

So, to reduce parameters you could have a single e.g. levenshtein_thresholds parameter (instead of one for surname and forename). Then it is a question whether 1) you apply this parameter sequentially to surname and forename as two separate comparison levels or 2) combine then in an _or statement. I think the first option is more similar to what we converged on in the link&learn?

@RossKen
Copy link
Contributor Author

RossKen commented May 12, 2023

Thanks for having a look at this @aliceoleary0!

Good point on the null_level - I have now changed that.

I agree that having one set of fuzzy-matching parameters giving separate forename and surname levels makes sense and reduces complexity e.g.:

  • surname levenshtein <= 2
  • forename levenshtein <-2
    I think I got caught up in allowing the most flexibility possible within the function, but I think it makes sense to keep this simple and users can construct their oven comparison from comparison levels if they want something more bespoke.

My point above on combining fuzzy levels was more about whether all fuzzy matches should be included in one level for forename and one level for surname, but I realise my explanation was not the clearest. E.g.

  • surname (levenshtein <= 2 OR jaro_winkler>0.9)
  • forename (levenshtein <=2 OR jaro_winkler>0.9)

This would not simplify/reduce the parameter that the users need to provide (other than they would only specify one threshold for each comparator). The main concern I have once we get down to these fuzzy levels on individual names is not having enough examples to train on, so combining fuzzy matches for e.g. forename as above would provide more chance for records to get down to the fuzzy levels.

@RossKen
Copy link
Contributor Author

RossKen commented May 12, 2023

@aliceoleary0 I have updated the function and I think I am happy with how it functions now, but would be keen to hear your thoughts.
I have also written up some documentation in the out-of-the-box comparison and feature engineering topic guides so if you wouldn't mind having a look at those too that would be ace.

Final thing to fix up are the tests, which I will do early next week then we should be good to go I reckon. I will also need to update splink_demos as changes on this branch are causing it to break.

@RossKen RossKen requested a review from aliceoleary0 May 15, 2023 22:10
@aliceoleary0
Copy link
Contributor

@RossKen thanks for this - will review later today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Full name example: in the example table do you want an empty surname element to generate a Nan in the full_name column (first row)? Makes more sense to me if the full_name just includes forename in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to leave as is as individual surname match will catch this

Copy link
Contributor

@aliceoleary0 aliceoleary0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying what you meant re. combining fuzzy levels- I get your point on that now.

I guess we'd want to re-train the models in a few of our existing pipelines (ideally with a range of data quality in names columns) and see what the match weights look like / how good the sampling is when we run with e.g levenshtein and jaro_winkler (or whatever we want the defaults to be in this template): 1) As separate levels vs 2) As a single cll.or_ general fuzzy level?

Basically if we are suggesting a new combination of fuzzy match levels etc than were implemented widely in our existing pipelines I guess we'd want to test it before suggesting it as a default?

See also this thread for discussion on whether the OR approach might introduce extra computation (https://mojdt.slack.com/archives/C02TCQLLJAX/p1679401803941459). I'm unsure re what this means in terms of how the composition levels have been implemented.

@RossKen
Copy link
Contributor Author

RossKen commented May 16, 2023

Going to leave the fuzzy matches as they are for simplicity. Getting into OR statements feels like it will get more confusing for users. This way it will be consistent with the other ctl functions.

Thanks for reviewing @aliceoleary0!

@RossKen RossKen merged commit 7261b5e into master May 18, 2023
@RossKen RossKen deleted the full_name_ctl branch May 18, 2023 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants