Any plans to include data standardisation module? #986
Replies: 5 comments 14 replies
-
I think it is a great idea. The Chicago-based company DataMade has two useful Python libraries for data standardisation: probablepeople for parsing names, and usaddress for parsing U.S. addresses. At minimum, I would like to see some |
Beta Was this translation helpful? Give feedback.
-
There are no current plans to do this, mostly due to resource constraints. In an ideal world we'd have a robust set of functions for data cleaning, but they're somewhat challenging to write given that we're attempting to support multiple backends. I agree that we should provide better docs - e.g. signposting existing libraries EDIT: Just for clarification after reading through this thread: As it stands, it's unlikely standardisation functions will ever be built into Splink itself. Whilst they are undoubtably useful, we want to avoid bloat and separate concerns. So if we ever were to work on this, it would be a separate library. |
Beta Was this translation helpful? Give feedback.
-
I think splink should stay focused and not tackle pre-processing. It can link out to external libraries as suggestions, but I don't think we should try to reinvent the wheel. |
Beta Was this translation helpful? Give feedback.
-
I've been watching this thread with a lot of interest. Would folks be participating in some brainstorming around preprocessing, and what sort of docs, external libraries, etc exist as well as what gaps there are that splink or other open-source efforts might consider filling? We could continue collecting ideas on this discussion thread, but it might also be fun to meet in real-time over Zoom or something. |
Beta Was this translation helpful? Give feedback.
-
Re the GUI for clerical matching: I am not sure about the workflow but it seems very promising! Did you consider to prefill the
|
Beta Was this translation helpful? Give feedback.
-
First of all thank you very much for the great work and providing Splink as open-source. Its a great tool and I used it regularly for several projects.
Did you consider including a preprocessing module so datasets can be standardised (same column names, and consistent formatting like lowercased, punctuation cleaned up)?
Something similar to this: https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html
Thank you,
Miguel
Beta Was this translation helpful? Give feedback.
All reactions