Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a feature engineering topic guide to docs #1121

Closed
RossKen opened this issue Mar 15, 2023 · 3 comments · Fixed by #1178
Closed

Add a feature engineering topic guide to docs #1121

RossKen opened this issue Mar 15, 2023 · 3 comments · Fixed by #1178
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@RossKen
Copy link
Contributor

RossKen commented Mar 15, 2023

Is your proposal related to a problem?

While we do not want to create a data standardisation module module. It would be good to have a place to document ideas/suggestion on how users can get the most out of their data in the preprocessing step, other than standardisation, before it gets to splink in order to get better comparisons (and potentially some code examples).

For example

  • Phonetics - we have a high level guide describing what sounded and dmetaphone are, but not how to actually use them (e.g. the scala UDFs). We also need to reword to explain the benefits in comparison levels as well as blocking.
  • Postcodes - We should soon have a string matching function for postcodes, but in most cases if users created lat-long columns then would get much more value out of the cll. distance_in_km_level and cl.distance_in_km_at_thresholds functions.

Describe the solution you'd like

Topic guide in the docs with suggestions for feature engineering.

Describe alternatives you've considered

Additional context

@RossKen RossKen added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 15, 2023
@RossKen
Copy link
Contributor Author

RossKen commented Mar 15, 2023

Example of user query - #1110

@aalexandersson
Copy link
Contributor

aalexandersson commented Mar 16, 2023

Another example of a missing high-level topic guide which directly affects pre-processing other than standardisation:

  • Test data - How many test datasets are currently easily available? How are these test data similar or different, for example in terms of size, missingness, variables, and type of errors and amount of errors? How can the splink test data be improved? What other record linkage test data are available? Is there any attempt to formally cooperate with other known developers of Python test data such as @aflaxman and @joke2k ?

@aflaxman
Copy link

You can rest assured that I'll have my eye on anything splink devs are working on in this space! Let me know if there is anything I can do to help. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants