Any plans to include data standardisation module? #986

amgfernandes · 2023-01-20T08:26:19Z

amgfernandes
Jan 20, 2023

First of all thank you very much for the great work and providing Splink as open-source. Its a great tool and I used it regularly for several projects.

Did you consider including a preprocessing module so datasets can be standardised (same column names, and consistent formatting like lowercased, punctuation cleaned up)?

Something similar to this: https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html

Thank you,
Miguel

aalexandersson · 2023-01-26T22:14:54Z

aalexandersson
Jan 26, 2023

I think it is a great idea. The Chicago-based company DataMade has two useful Python libraries for data standardisation:

probablepeople for parsing names, and usaddress for parsing U.S. addresses.

At minimum, I would like to see some splink documentation on how to use existing tools such as probablepeople and usaddress. It would be a great addition to the data prerequisite tutorial.

2 replies

RobinL Jan 27, 2023
Maintainer

Nice - wasn't aware of these two libs. For others reading this, for international addresses, there's also libpostal

checkbook-org Feb 17, 2023

My experience with libpostal for US addresses is not great - it seemed to have trouble with addresses that didn't seem like they should be problematic. Also, if I'm thinking of the same library, on Windows it required building an multi-GB .dll that seems to contain training data.

Another couple libraries I came across are:
Deepparse
https://github.com/GRAAL-Research/deepparse

Postmastr in R, not sure if that's US-centric or not. One benefit of postmastr is that it seems to take a narrowing approach to parsing, so that it lists of cities can be used rather than putting the entire job of recognizing city names on the parser, etc...

RobinL · 2023-01-27T10:39:33Z

RobinL
Jan 27, 2023
Maintainer

There are no current plans to do this, mostly due to resource constraints. In an ideal world we'd have a robust set of functions for data cleaning, but they're somewhat challenging to write given that we're attempting to support multiple backends.

I agree that we should provide better docs - e.g. signposting existing libraries

EDIT: Just for clarification after reading through this thread:

As it stands, it's unlikely standardisation functions will ever be built into Splink itself. Whilst they are undoubtably useful, we want to avoid bloat and separate concerns. So if we ever were to work on this, it would be a separate library.

1 reply

amgfernandes Jan 27, 2023
Author

Thanks Robin. Thats a great idea about the documentation. I used successfully the Python Record Linkage Toolkit Preprocessing module with Splink so this could maybe be mentioned in the docs.

NickCrews · 2023-02-16T18:00:57Z

NickCrews
Feb 16, 2023

I think splink should stay focused and not tackle pre-processing. It can link out to external libraries as suggestions, but I don't think we should try to reinvent the wheel.

1 reply

RobinL Feb 26, 2023
Maintainer

Agreed. Much better to use existing solutions if they exist.

Furthermore, Splink is already a lot of lines of code and I don't want to further increase the maintenance burden. Also, standardisation functions could be written pretty much independently from Splink, so should probably be a separate library rather than being strongly coupled to Splink itself

aflaxman · 2023-02-16T20:54:29Z

aflaxman
Feb 16, 2023

I've been watching this thread with a lot of interest. Would folks be participating in some brainstorming around preprocessing, and what sort of docs, external libraries, etc exist as well as what gaps there are that splink or other open-source efforts might consider filling? We could continue collecting ideas on this discussion thread, but it might also be fun to meet in real-time over Zoom or something.

10 replies

RobinL Feb 27, 2023
Maintainer

Thanks for the jstor link - I had not come across this before so I will try and get my head around it!

Re: the GUI for clerical matching. It's still in development. It basically works, but frustratingly I just haven't had enough time to finish it off. For what it's worth, a current mostly-working draft is here. That link includes all the code and the GUI, but obviously in the final version it would just present the GUI. But this GUI is totally separate to the Splink code - the only 'interface' between the two is that the input data and output data must be in a specific format.

checkbook-org Feb 28, 2023

I'm interested in your idea of doing a review of preprocessing tools.
A couple use cases I see as worth some joint development effort are building out more sophisticated comparisons for names and addresses.
1 - Matching Names
Names are challenging because in many cases it's hard to definitively say what is a last name vs a middle name. And of course in non-English speaking cultures there are a variety of ways of handling multiple surnames.
Something that uses the ComparisonLevels component of Splink, and figures out a sensible scheme to parse, clean, and compare names given the uncertainty of whether a part of a name is in the correct field or not, plus good documentation seems like it would address a very common set of use cases and would be reasonably generalizable.
2- Matching addresses using geospatial functions
Along the same lines, some form of address matching that is geographical in nature (thinking of using the new DuckDb geospatial extension), along with good instructions on how to preprocess and use also seems like it would address a common use case.

I have been working with Pelias, and generally it seems like a good tool (North America-centric, if I'm not mistaken). However, I've had to work around its dependency on libpostal. Currently investigating usaddress as a replacement. My organization also uses Smarty Streets, and I have been impressed with the data quality. They recently raised their prices, though, so it's gotten kind of unaffordable.

RobinL Mar 9, 2023
Maintainer

This is interesting for US addresses:
https://geocoding.geo.census.gov/geocoder/
from the faq:

The overall goal of the Census Geocoder is to provide the public with the capability to geocode an address efficiently and accurately. There are multiple options that a user can select to geocode an address with the Geocoder. For further information on use of the Census Geocoder, please refer to the User Guide or Frequently Asked Questions documents found below in the Documentation section.

RobinL Mar 9, 2023
Maintainer

chat gpt also does quite a good job. this is using the gpt-3.5-turbo model. would possibly get quite expensive for a large dataset though. addresses from https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/

aalexandersson Mar 9, 2023

This is very interesting! Can ChatGPT improve if being told to use a technical specification? For example:

"Standardize the same addresses, but this time the data are for the United States instead of United Kingdom. Use the technical specification of Project US@ at https://asapnet.org/wp-content/uploads/2022/03/Project_US_FINAL_Technical_Specification_Version_1.0.pdf. Use the field name city instead of the field town. Ignore the fields village and postcode."

For example, in your example data, the road name "Ammanford Road" should be standardized to "AMMANFORD RD".

aalexandersson · 2023-02-27T15:31:34Z

aalexandersson
Feb 27, 2023

Re the GUI for clerical matching: I am not sure about the workflow but it seems very promising! Did you consider to prefill the clerical match score with the estimated match probability? I see two possible advantages of doing it. But maybe some rounding of decimals is needed because I cannot imagine four digits being useful in a clerical review:

Speed up the clerical review (at the risk of being biased from pre-filled values). In practice, usually most record pairs do not need clerical review and there is only enough time to clerically review the uncertain record pairs ("possibles").
Enable the automatic calculation of a 2-by-2 "summary truth table". It is also known under several other names such as confusion matrix or error matrix". In the R package fastLink, the implementation is the function confusion().

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plans to include data standardisation module? #986

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Any plans to include data standardisation module? #986

Replies: 5 comments · 14 replies

RobinL Jan 27, 2023 Maintainer

RobinL Jan 27, 2023 Maintainer

amgfernandes Jan 27, 2023 Author

RobinL Feb 26, 2023 Maintainer

RobinL Feb 27, 2023 Maintainer

RobinL Mar 9, 2023 Maintainer

RobinL Mar 9, 2023 Maintainer

Replies: 5 comments 14 replies

RobinL Jan 27, 2023
Maintainer

RobinL
Jan 27, 2023
Maintainer

amgfernandes Jan 27, 2023
Author

RobinL Feb 26, 2023
Maintainer

RobinL Feb 27, 2023
Maintainer

RobinL Mar 9, 2023
Maintainer

RobinL Mar 9, 2023
Maintainer