-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splink dummy data #1358
Splink dummy data #1358
Conversation
I've been meaning to reach out to your team about a related topic, because I've just gotten to beta on a python package for generating simulated data for cases like this: https://pseudopeople.readthedocs.io/ Might this be an opportunity to brainstorm potential pseudopeople integration into splink docs? Let me know if You can see a notebook using it with splink here: https://colab.research.google.com/drive/1YVnK1G9vk_RPqH4kv52eZj_qCC8UtWMS?usp=sharing |
@aflaxman that looks fantastic. I'd definitely be up for adding an example to the Splink docs that uses a dataset generated by pseudopeople. Do you have a suggestion of a pre-created dataset that we could use? Or would you suggest that the example both generates and then links a dataset? If it turns out well we could also consider adding to these proposed inbuilt datasets Please could we continue the discussion here to separate it from this issue |
@ADBond love this! I'd probably start with the fake_1000 dataset only, simply because I think we want to think a little carefully about what are the best example datasets to include (i.e. which ones we actually want people to play with), but I wouldn't want that to sidetrack getting this feature added |
Great, I've responded with some initial thoughts on #1361 . Thanks to all of you for your work on this amazing project! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ADBond this is fantastic!
All works as expected bar one small thing. When downloading the data I got a SSLCertVerificationError
which I was able to get rid of using the ssl
package with
ssl._create_default_https_context = ssl._create_unverified_context
which made it run fine, but may not be best practice.
One additional thought, if these datasets are going to become more of a mainstay for Splink, do you think it is worth migrating all the datasets we are going to build in to a separate repo? Then they can be called from the demos or by users? It may not be worthwhile but just popped into my head that splink_demos and the data could be made separate as a result of this work. |
Hmm that's strange - I wonder why the certificate was failing there. Have not been able to recreate this error. What do you think the best thing to do here is? I'm a bit reluctant to put something like this directly in the package as it feels a little iffy to be getting people to bypass certification errors, but maybe if this error happens it could direct the user to a note with this workaround on the docs page (with some words of caution)? |
Yes I think ultimately it is probably worth separating out where the data lives to its own location. Might be worth a bit of a conversation about this when deciding any further datasets to include, and whether any alternate places to store them make any sense |
@ADBond yes agreed that moving the data is a problem for another day once we figure out what our overall strategy for managing these datasets is. On the Certification Error stuff - I will have another play with this to try and figure out why I was getting the issue. I'm just wary of merging something that could error out like this when it will go into tutorials, example notebooks etc |
Hey ADBond, I have tried this with a few different environments this afternoon and it is working fine. Don't know what was happening the other day but I think this is good to go. I will approve now, but a couple of small docs things before you merge. Would you mind:
|
@RossKen I also get this same SSLCertVerificationError (running the splink demos locally on vscode, splink version 3.9.5) |
Hmm, that's weird. Can you open up a new issue for it and I can add it to my list? |
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
No
Give a brief description for the solution you have provided
In writing some documentation, I realised that it might be useful, particularly for new users, to have some dummy data to get up-and-running with pre-baked into Splink, without them having to copy files and read in manually
This works simply from the user perspective - simply import
splink_data_sets
, and each attribute is a pandas dataframes:Under-the-hood this downloads the dataset, or retrieves it from a local cache folder if it has been downloaded previously, so as not to bloat the size of the package. This only happens as datasets are accessed, so we could potentially include very large / rich datasets without any performance / storage penalty if they are never accessed.
There is a basic test - there should probably be more detailed tests which properly deal with stuff around downloading/caching, but that will be fairly fiddly so probably best as a follow-up.
Things not considered (probably for future PRs):
haven't checked if caching works properly when splink is installedPR Checklist