-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forename Surname ctl #1174
Forename Surname ctl #1174
Conversation
Test: test_2_rounds_1k_duckdbPercentage change: -16.9%
Test: test_2_rounds_1k_sqlitePercentage change: -9.9%
Click here for vega lite time series charts |
…link into full_name_ctl
Hey @aliceoleary0, I have cleaned this up now so would be good to have you take a look at it when you have time. My main concern as it stands is that there are too many parameters for fuzzy match for One thought I has was to have a single fuzzy match level for both Don't worry about the failing tests, I assume I have broken something while cleaning this up so can have a dig into it next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some thoughts. All code comments refer to comparison_template_library.py (sorry this file isn't in current commit so can't comment on it directly).
Think the null level should use _and rather than _or (line 719).
Also think it would be useful to include checks for when user inputs multiple fuzzy string comparison operations (as per DateComparisonBase line 208), otherwise I guess order (in terms of permissiveness) matters and this might be complicated given that different thresholds can be given for each fuzzy string operation.
Re. number of parameters - I agree that it might be overkill (given that these are supposed to be out-of-the-box templates) to have different types of fuzzy matches happening for surname and forename (i.e. might make sense to have one fuzzy match method and threshold for both).
However, the second point about whether these should be combined into an _or statement seems to me to be a separate question as "fuzzy surname" followed by "fuzzy forename" isn't the same as "fuzzy surname or fuzzy forename". Although I'm not sure in practice how much difference this would make to the model.
So, to reduce parameters you could have a single e.g. levenshtein_thresholds parameter (instead of one for surname and forename). Then it is a question whether 1) you apply this parameter sequentially to surname and forename as two separate comparison levels or 2) combine then in an _or statement. I think the first option is more similar to what we converged on in the link&learn?
Thanks for having a look at this @aliceoleary0! Good point on the null_level - I have now changed that. I agree that having one set of fuzzy-matching parameters giving separate forename and surname levels makes sense and reduces complexity e.g.:
My point above on combining fuzzy levels was more about whether all fuzzy matches should be included in one level for forename and one level for surname, but I realise my explanation was not the clearest. E.g.
This would not simplify/reduce the parameter that the users need to provide (other than they would only specify one threshold for each comparator). The main concern I have once we get down to these fuzzy levels on individual names is not having enough examples to train on, so combining fuzzy matches for e.g. forename as above would provide more chance for records to get down to the fuzzy levels. |
@aliceoleary0 I have updated the function and I think I am happy with how it functions now, but would be keen to hear your thoughts. Final thing to fix up are the tests, which I will do early next week then we should be good to go I reckon. I will also need to update splink_demos as changes on this branch are causing it to break. |
@RossKen thanks for this - will review later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Full name example: in the example table do you want an empty surname element to generate a Nan in the full_name column (first row)? Makes more sense to me if the full_name just includes forename in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to leave as is as individual surname match will catch this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying what you meant re. combining fuzzy levels- I get your point on that now.
I guess we'd want to re-train the models in a few of our existing pipelines (ideally with a range of data quality in names columns) and see what the match weights look like / how good the sampling is when we run with e.g levenshtein and jaro_winkler (or whatever we want the defaults to be in this template): 1) As separate levels vs 2) As a single cll.or_ general fuzzy level?
Basically if we are suggesting a new combination of fuzzy match levels etc than were implemented widely in our existing pipelines I guess we'd want to test it before suggesting it as a default?
See also this thread for discussion on whether the OR approach might introduce extra computation (https://mojdt.slack.com/archives/C02TCQLLJAX/p1679401803941459). I'm unsure re what this means in terms of how the composition levels have been implemented.
Going to leave the fuzzy matches as they are for simplicity. Getting into Thanks for reviewing @aliceoleary0! |
No description provided.