Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splink datasets - simplify + restructure #2378

Merged
merged 10 commits into from
Sep 4, 2024
Merged

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Sep 4, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #2369.

Much of this is long-overdue tidying and simplifying of the over-complex code that was in place. There are a few things repeated in a couple of places now (dataset names + descriptions), but I think this is okay, as these are unlikely to change / be added to particularly frequently.

Give a brief description for the solution you have provided

Functionality is largely the same. Details:

  • splink_datasets should work now even if the user does not have write permissions for the caching folder. If not, they will download directly and use without on-disk caching
  • datasets are cached on the object when accessed, so repeated calls will not cause multiple reads/downloads
  • code is split into modules to make it more maintainable
  • dataset attributes are defined statically, so can be accessed with autocomplete
  • as well as descriptions, dataset docstrings include column information

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter
  • Run the spellchecker (if appropriate)

splink/internals/datasets/splink_datasets.py Outdated Show resolved Hide resolved
Co-authored-by: Robin Linacre <robin.linacre@digital.justice.gov.uk>
@ADBond ADBond merged commit 25219a4 into master Sep 4, 2024
@ADBond ADBond deleted the splink-datasets-simplify branch September 4, 2024 12:40
@ADBond ADBond mentioned this pull request Sep 4, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Splink datasets revamp
2 participants