Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splink dummy data #1358

Merged
merged 17 commits into from
Jul 5, 2023
Merged

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Jun 23, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

No

Give a brief description for the solution you have provided

In writing some documentation, I realised that it might be useful, particularly for new users, to have some dummy data to get up-and-running with pre-baked into Splink, without them having to copy files and read in manually

This works simply from the user perspective - simply import splink_data_sets, and each attribute is a pandas dataframes:

from splink.datasets import splink_data_sets

df = splink_data_sets.fake_1000
...
linker = DuckDBLinker(df, settings)

Under-the-hood this downloads the dataset, or retrieves it from a local cache folder if it has been downloaded previously, so as not to bloat the size of the package. This only happens as datasets are accessed, so we could potentially include very large / rich datasets without any performance / storage penalty if they are never accessed.

There is a basic test - there should probably be more detailed tests which properly deal with stuff around downloading/caching, but that will be fairly fiddly so probably best as a follow-up.

Things not considered (probably for future PRs):

  • range of data sets to include
  • where the datasets should live
  • whether options aside from pandas would be useful (load directly into your backend?)
  • haven't checked if caching works properly when splink is installed
  • any kind of niceness if downloading is slow / fails

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks at tutorials in splink_demos (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@aflaxman
Copy link

I've been meaning to reach out to your team about a related topic, because I've just gotten to beta on a python package for generating simulated data for cases like this: https://pseudopeople.readthedocs.io/

Might this be an opportunity to brainstorm potential pseudopeople integration into splink docs? Let me know if

You can see a notebook using it with splink here: https://colab.research.google.com/drive/1YVnK1G9vk_RPqH4kv52eZj_qCC8UtWMS?usp=sharing

@RobinL
Copy link
Member

RobinL commented Jun 24, 2023

@aflaxman that looks fantastic. I'd definitely be up for adding an example to the Splink docs that uses a dataset generated by pseudopeople. Do you have a suggestion of a pre-created dataset that we could use? Or would you suggest that the example both generates and then links a dataset? If it turns out well we could also consider adding to these proposed inbuilt datasets

Please could we continue the discussion here to separate it from this issue

@RobinL
Copy link
Member

RobinL commented Jun 24, 2023

@ADBond love this! I'd probably start with the fake_1000 dataset only, simply because I think we want to think a little carefully about what are the best example datasets to include (i.e. which ones we actually want people to play with), but I wouldn't want that to sidetrack getting this feature added

@aflaxman
Copy link

@aflaxman that looks fantastic. I'd definitely be up for adding an example to the Splink docs that uses a dataset generated by pseudopeople. Do you have a suggestion of a pre-created dataset that we could use? Or would you suggest that the example both generates and then links a dataset? If it turns out well we could also consider adding to these proposed inbuilt datasets

Please could we continue the discussion here to separate it from this issue

Great, I've responded with some initial thoughts on #1361 . Thanks to all of you for your work on this amazing project!

@ADBond ADBond marked this pull request as ready for review June 28, 2023 17:55
Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ADBond this is fantastic!

All works as expected bar one small thing. When downloading the data I got a SSLCertVerificationError which I was able to get rid of using the ssl package with
ssl._create_default_https_context = ssl._create_unverified_context
which made it run fine, but may not be best practice.

@RossKen
Copy link
Contributor

RossKen commented Jun 30, 2023

One additional thought, if these datasets are going to become more of a mainstay for Splink, do you think it is worth migrating all the datasets we are going to build in to a separate repo? Then they can be called from the demos or by users? It may not be worthwhile but just popped into my head that splink_demos and the data could be made separate as a result of this work.

@ADBond
Copy link
Contributor Author

ADBond commented Jul 3, 2023

All works as expected bar one small thing. When downloading the data I got a SSLCertVerificationError which I was able to get rid of using the ssl package with ssl._create_default_https_context = ssl._create_unverified_context which made it run fine, but may not be best practice.

Hmm that's strange - I wonder why the certificate was failing there. Have not been able to recreate this error.

What do you think the best thing to do here is? I'm a bit reluctant to put something like this directly in the package as it feels a little iffy to be getting people to bypass certification errors, but maybe if this error happens it could direct the user to a note with this workaround on the docs page (with some words of caution)?

@ADBond
Copy link
Contributor Author

ADBond commented Jul 3, 2023

One additional thought, if these datasets are going to become more of a mainstay for Splink, do you think it is worth migrating all the datasets we are going to build in to a separate repo? Then they can be called from the demos or by users? It may not be worthwhile but just popped into my head that splink_demos and the data could be made separate as a result of this work.

Yes I think ultimately it is probably worth separating out where the data lives to its own location. Might be worth a bit of a conversation about this when deciding any further datasets to include, and whether any alternate places to store them make any sense

@RossKen
Copy link
Contributor

RossKen commented Jul 4, 2023

@ADBond yes agreed that moving the data is a problem for another day once we figure out what our overall strategy for managing these datasets is.

On the Certification Error stuff - I will have another play with this to try and figure out why I was getting the issue. I'm just wary of merging something that could error out like this when it will go into tutorials, example notebooks etc

@RossKen
Copy link
Contributor

RossKen commented Jul 4, 2023

Hey ADBond, I have tried this with a few different environments this afternoon and it is working fine. Don't know what was happening the other day but I think this is good to go.

I will approve now, but a couple of small docs things before you merge. Would you mind:

@ADBond ADBond merged commit 4c7a068 into moj-analytical-services:master Jul 5, 2023
@ADBond ADBond deleted the lazyload-data branch July 5, 2023 09:13
@aliceoleary0
Copy link
Contributor

@RossKen I also get this same SSLCertVerificationError (running the splink demos locally on vscode, splink version 3.9.5)

@RossKen
Copy link
Contributor

RossKen commented Aug 17, 2023

Hmm, that's weird. Can you open up a new issue for it and I can add it to my list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants