Forename Surname ctl #1174

RossKen · 2023-04-08T17:27:05Z

No description provided.

…link into full_name_ctl

github-actions · 2023-04-08T17:28:12Z

Test: test_2_rounds_1k_duckdb

Percentage change: -16.9%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1638	2023-05-18	23:26:53	1.58896	1.55737	(detached head)	`1494dfa`	Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz	2.0952 GHz	`1494dfa`

Test: test_2_rounds_1k_sqlite

Percentage change: -9.9%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1640	2023-05-18	23:26:53	3.84402	3.83713	(detached head)	`1494dfa`	Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz	2.0952 GHz	`1494dfa`

Click here for vega lite time series charts

…link into full_name_ctl

RossKen · 2023-05-05T15:13:19Z

Hey @aliceoleary0, I have cleaned this up now so would be good to have you take a look at it when you have time.

My main concern as it stands is that there are too many parameters for fuzzy match for forename and surname. Currenly there is the option to turn on levenshtein, jaro_winkler, jaccard for both. Plus I would need to add jaro and damerau-levenshtein once it has been merged. So, having potentially 10 parameters, corresponding to potentially 10 different comparison levels, and that's after all of the forename/surname combinations as well.

One thought I has was to have a single fuzzy match level for both surname and forename where users can specify their comparator function and threshold of choice combined by the new cll.or_ function. Would be keen to get your opinion though!

Don't worry about the failing tests, I assume I have broken something while cleaning this up so can have a dig into it next week.

aliceoleary0

Here are some thoughts. All code comments refer to comparison_template_library.py (sorry this file isn't in current commit so can't comment on it directly).

Think the null level should use _and rather than _or (line 719).

Also think it would be useful to include checks for when user inputs multiple fuzzy string comparison operations (as per DateComparisonBase line 208), otherwise I guess order (in terms of permissiveness) matters and this might be complicated given that different thresholds can be given for each fuzzy string operation.

Re. number of parameters - I agree that it might be overkill (given that these are supposed to be out-of-the-box templates) to have different types of fuzzy matches happening for surname and forename (i.e. might make sense to have one fuzzy match method and threshold for both).

However, the second point about whether these should be combined into an _or statement seems to me to be a separate question as "fuzzy surname" followed by "fuzzy forename" isn't the same as "fuzzy surname or fuzzy forename". Although I'm not sure in practice how much difference this would make to the model.

So, to reduce parameters you could have a single e.g. levenshtein_thresholds parameter (instead of one for surname and forename). Then it is a question whether 1) you apply this parameter sequentially to surname and forename as two separate comparison levels or 2) combine then in an _or statement. I think the first option is more similar to what we converged on in the link&learn?

RossKen · 2023-05-12T11:45:26Z

Thanks for having a look at this @aliceoleary0!

Good point on the null_level - I have now changed that.

I agree that having one set of fuzzy-matching parameters giving separate forename and surname levels makes sense and reduces complexity e.g.:

surname levenshtein <= 2
forename levenshtein <-2
I think I got caught up in allowing the most flexibility possible within the function, but I think it makes sense to keep this simple and users can construct their oven comparison from comparison levels if they want something more bespoke.

My point above on combining fuzzy levels was more about whether all fuzzy matches should be included in one level for forename and one level for surname, but I realise my explanation was not the clearest. E.g.

surname (levenshtein <= 2 OR jaro_winkler>0.9)
forename (levenshtein <=2 OR jaro_winkler>0.9)

This would not simplify/reduce the parameter that the users need to provide (other than they would only specify one threshold for each comparator). The main concern I have once we get down to these fuzzy levels on individual names is not having enough examples to train on, so combining fuzzy matches for e.g. forename as above would provide more chance for records to get down to the fuzzy levels.

RossKen · 2023-05-12T16:08:09Z

@aliceoleary0 I have updated the function and I think I am happy with how it functions now, but would be keen to hear your thoughts.
I have also written up some documentation in the out-of-the-box comparison and feature engineering topic guides so if you wouldn't mind having a look at those too that would be ace.

Final thing to fix up are the tests, which I will do early next week then we should be good to go I reckon. I will also need to update splink_demos as changes on this branch are causing it to break.

…link into full_name_ctl

aliceoleary0 · 2023-05-16T09:05:54Z

@RossKen thanks for this - will review later today.

docs/topic_guides/comparison_templates.ipynb

aliceoleary0 · 2023-05-16T10:43:23Z

docs/topic_guides/feature_engineering.md

In Full name example: in the example table do you want an empty surname element to generate a Nan in the full_name column (first row)? Makes more sense to me if the full_name just includes forename in this case.

Going to leave as is as individual surname match will catch this

splink/comparison_template_library.py

aliceoleary0

Thanks for clarifying what you meant re. combining fuzzy levels- I get your point on that now.

I guess we'd want to re-train the models in a few of our existing pipelines (ideally with a range of data quality in names columns) and see what the match weights look like / how good the sampling is when we run with e.g levenshtein and jaro_winkler (or whatever we want the defaults to be in this template): 1) As separate levels vs 2) As a single cll.or_ general fuzzy level?

Basically if we are suggesting a new combination of fuzzy match levels etc than were implemented widely in our existing pipelines I guess we'd want to test it before suggesting it as a default?

See also this thread for discussion on whether the OR approach might introduce extra computation (https://mojdt.slack.com/archives/C02TCQLLJAX/p1679401803941459). I'm unsure re what this means in terms of how the composition levels have been implemented.

RossKen · 2023-05-16T13:54:07Z

Going to leave the fuzzy matches as they are for simplicity. Getting into OR statements feels like it will get more confusing for users. This way it will be consistent with the other ctl functions.

Thanks for reviewing @aliceoleary0!

RossKen added 10 commits March 27, 2023 15:15

WIP

5c20e3b

WIP

91a6b91

progress

b138848

Start restructure

324dd2e

WIP

4cf6731

WIP

2689f6c

progress

ff8634f

commit for rebase

2dcb260

updates

e250507

Merge branch 'full_name_ctl' of github.com:moj-analytical-services/sp…

52c5dee

…link into full_name_ctl

RossKen marked this pull request as draft April 8, 2023 17:27

RossKen added 11 commits April 8, 2023 18:36

Merge branch 'master' into full_name_ctl

eea32e8

minimal function, description function

c614d75

add comparison at threshold and desc function

a1caa57

fix utils fn & add initial test

0ba8315

update docs

2a6bfb6

Delete feature_engineering.md

a866cd0

lint with black

063b391

Merge branch 'master' into full_name_ctl

1f56cfa

lint with black

33dc727

clean up tf adjustments

77d61d9

Merge branch 'full_name_ctl' of github.com:moj-analytical-services/sp…

c28df26

…link into full_name_ctl

RossKen requested a review from aliceoleary0 May 5, 2023 15:06

RossKen marked this pull request as ready for review May 5, 2023 15:06

aliceoleary0 reviewed May 10, 2023

View reviewed changes

RossKen added 2 commits May 12, 2023 14:57

change null, simplify fuzzies and add labels

91e43a2

Update examples and add docs

cfb1855

RossKen added 3 commits May 12, 2023 16:07

Improve spacing

0583ac7

update fe

74719c0

Merge branch 'master' into full_name_ctl

ec67c1d

RossKen added 4 commits May 14, 2023 20:24

fix duckdb test

26b6836

lint with black

92c7b38

fix spark test fail

38e1923

Merge branch 'full_name_ctl' of github.com:moj-analytical-services/sp…

3eef395

…link into full_name_ctl

RossKen requested a review from aliceoleary0 May 15, 2023 22:10

aliceoleary0 reviewed May 16, 2023

View reviewed changes

docs/topic_guides/comparison_templates.ipynb Show resolved Hide resolved

aliceoleary0 reviewed May 16, 2023

View reviewed changes

splink/comparison_template_library.py Show resolved Hide resolved

aliceoleary0 reviewed May 16, 2023

View reviewed changes

PR Review updates

04161d9

RossKen added 8 commits May 18, 2023 23:07

Merge branch 'master' into full_name_ctl

61095b1

Merge branch 'master' into full_name_ctl

8d9f3db

update athena ctl

1730fb9

lint with black

83724cf

typo

d1a0064

conflict

f29021d

small docs changes

2c436c6

docstrings

9f8705a

RossKen merged commit 7261b5e into master May 18, 2023

RossKen deleted the full_name_ctl branch May 18, 2023 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forename Surname ctl #1174

Forename Surname ctl #1174

RossKen commented Apr 8, 2023

github-actions bot commented Apr 8, 2023 •

edited

Loading

RossKen commented May 5, 2023

aliceoleary0 left a comment

RossKen commented May 12, 2023 •

edited

Loading

RossKen commented May 12, 2023

aliceoleary0 commented May 16, 2023

aliceoleary0 May 16, 2023

RossKen May 16, 2023

aliceoleary0 left a comment

RossKen commented May 16, 2023

Forename Surname ctl #1174

Forename Surname ctl #1174

Conversation

RossKen commented Apr 8, 2023

github-actions bot commented Apr 8, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

RossKen commented May 5, 2023

aliceoleary0 left a comment

Choose a reason for hiding this comment

RossKen commented May 12, 2023 • edited Loading

RossKen commented May 12, 2023

aliceoleary0 commented May 16, 2023

aliceoleary0 May 16, 2023

Choose a reason for hiding this comment

RossKen May 16, 2023

Choose a reason for hiding this comment

aliceoleary0 left a comment

Choose a reason for hiding this comment

RossKen commented May 16, 2023

github-actions bot commented Apr 8, 2023 •

edited

Loading

RossKen commented May 12, 2023 •

edited

Loading