Splink dummy data #1358

ADBond · 2023-06-23T16:56:36Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

No

Give a brief description for the solution you have provided

In writing some documentation, I realised that it might be useful, particularly for new users, to have some dummy data to get up-and-running with pre-baked into Splink, without them having to copy files and read in manually

This works simply from the user perspective - simply import splink_data_sets, and each attribute is a pandas dataframes:

from splink.datasets import splink_data_sets

df = splink_data_sets.fake_1000
...
linker = DuckDBLinker(df, settings)

Under-the-hood this downloads the dataset, or retrieves it from a local cache folder if it has been downloaded previously, so as not to bloat the size of the package. This only happens as datasets are accessed, so we could potentially include very large / rich datasets without any performance / storage penalty if they are never accessed.

There is a basic test - there should probably be more detailed tests which properly deal with stuff around downloading/caching, but that will be fairly fiddly so probably best as a follow-up.

Things not considered (probably for future PRs):

range of data sets to include
where the datasets should live
whether options aside from pandas would be useful (load directly into your backend?)
~~haven't checked if caching works properly when splink is installed~~
any kind of niceness if downloading is slow / fails

PR Checklist

Added documentation for changes
Added feature to example notebooks at tutorials in splink_demos (if appropriate)
Added tests (if appropriate)
Made changes based off the latest version of Splink
Run the linter

aflaxman · 2023-06-23T18:44:08Z

I've been meaning to reach out to your team about a related topic, because I've just gotten to beta on a python package for generating simulated data for cases like this: https://pseudopeople.readthedocs.io/

Might this be an opportunity to brainstorm potential pseudopeople integration into splink docs? Let me know if

You can see a notebook using it with splink here: https://colab.research.google.com/drive/1YVnK1G9vk_RPqH4kv52eZj_qCC8UtWMS?usp=sharing

RobinL · 2023-06-24T20:49:57Z

@aflaxman that looks fantastic. I'd definitely be up for adding an example to the Splink docs that uses a dataset generated by pseudopeople. Do you have a suggestion of a pre-created dataset that we could use? Or would you suggest that the example both generates and then links a dataset? If it turns out well we could also consider adding to these proposed inbuilt datasets

Please could we continue the discussion here to separate it from this issue

RobinL · 2023-06-24T20:52:01Z

@ADBond love this! I'd probably start with the fake_1000 dataset only, simply because I think we want to think a little carefully about what are the best example datasets to include (i.e. which ones we actually want people to play with), but I wouldn't want that to sidetrack getting this feature added

aflaxman · 2023-06-25T21:47:20Z

@aflaxman that looks fantastic. I'd definitely be up for adding an example to the Splink docs that uses a dataset generated by pseudopeople. Do you have a suggestion of a pre-created dataset that we could use? Or would you suggest that the example both generates and then links a dataset? If it turns out well we could also consider adding to these proposed inbuilt datasets

Please could we continue the discussion here to separate it from this issue

Great, I've responded with some initial thoughts on #1361 . Thanks to all of you for your work on this amazing project!

RossKen

@ADBond this is fantastic!

All works as expected bar one small thing. When downloading the data I got a SSLCertVerificationError which I was able to get rid of using the ssl package with
ssl._create_default_https_context = ssl._create_unverified_context
which made it run fine, but may not be best practice.

RossKen · 2023-06-30T09:42:01Z

One additional thought, if these datasets are going to become more of a mainstay for Splink, do you think it is worth migrating all the datasets we are going to build in to a separate repo? Then they can be called from the demos or by users? It may not be worthwhile but just popped into my head that splink_demos and the data could be made separate as a result of this work.

ADBond · 2023-07-03T14:22:28Z

All works as expected bar one small thing. When downloading the data I got a SSLCertVerificationError which I was able to get rid of using the ssl package with ssl._create_default_https_context = ssl._create_unverified_context which made it run fine, but may not be best practice.

Hmm that's strange - I wonder why the certificate was failing there. Have not been able to recreate this error.

What do you think the best thing to do here is? I'm a bit reluctant to put something like this directly in the package as it feels a little iffy to be getting people to bypass certification errors, but maybe if this error happens it could direct the user to a note with this workaround on the docs page (with some words of caution)?

ADBond · 2023-07-03T14:24:59Z

One additional thought, if these datasets are going to become more of a mainstay for Splink, do you think it is worth migrating all the datasets we are going to build in to a separate repo? Then they can be called from the demos or by users? It may not be worthwhile but just popped into my head that splink_demos and the data could be made separate as a result of this work.

Yes I think ultimately it is probably worth separating out where the data lives to its own location. Might be worth a bit of a conversation about this when deciding any further datasets to include, and whether any alternate places to store them make any sense

RossKen · 2023-07-04T09:56:58Z

@ADBond yes agreed that moving the data is a problem for another day once we figure out what our overall strategy for managing these datasets is.

On the Certification Error stuff - I will have another play with this to try and figure out why I was getting the issue. I'm just wary of merging something that could error out like this when it will go into tutorials, example notebooks etc

RossKen · 2023-07-04T15:38:33Z

Hey ADBond, I have tried this with a few different environments this afternoon and it is working fine. Don't know what was happening the other day but I think this is good to go.

I will approve now, but a couple of small docs things before you merge. Would you mind:

Rebasing onto master and adding a blurb on inbuilt datasets on the documentation index page
Adding to the example in Getting Started

…mple

aliceoleary0 · 2023-08-17T16:11:17Z

@RossKen I also get this same SSLCertVerificationError (running the splink demos locally on vscode, splink version 3.9.5)

RossKen · 2023-08-17T16:26:56Z

Hmm, that's weird. Can you open up a new issue for it and I can add it to my list?

ADBond marked this pull request as ready for review June 28, 2023 17:55

RossKen reviewed Jun 30, 2023

View reviewed changes

RossKen approved these changes Jul 4, 2023

View reviewed changes

ADBond added 17 commits July 5, 2023 09:01

basic lazy-load dataset object, with file caching

49fee5a

multiformat + more metadata with data files

6647af6

download data basic progress bar

53468b3

basic dataset test

6f11ac3

some utils for the tests

c14271a

tidy

60f209c

docstrings and selective deletion

2cfb26c

Documentation page on datasets

59b9ff1

change object names for consistency/clarity

226c2ad

tidying up

aacb3fd

dataset docs - automated table + additional metadata

65897f2

remove fake_20000

d3bb06b

readme - make use of datasets in quickstart

7c2caae

linting fix

116044b

desired prints - avoid lints

e709bf7

move noqa location for print

96aa4a5

Info on datasets on docs overview page + add into getting started exa…

d0f9258

…mple

ADBond force-pushed the lazyload-data branch from 8d90c1b to d0f9258 Compare July 5, 2023 09:02

ADBond merged commit 4c7a068 into moj-analytical-services:master Jul 5, 2023

ADBond deleted the lazyload-data branch July 5, 2023 09:13

ADBond mentioned this pull request Jul 6, 2023

Add dataset table generation script to docs workflow #1399

Merged

9 tasks

RossKen mentioned this pull request Aug 18, 2023

SSLCertVerificationError when downloading splink datasets #1542

Closed

2 tasks

ThomasHepworth mentioned this pull request Sep 8, 2023

group -> cluster #1585

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splink dummy data #1358

Splink dummy data #1358

ADBond commented Jun 23, 2023 •

edited

Loading

aflaxman commented Jun 23, 2023

RobinL commented Jun 24, 2023 •

edited

Loading

RobinL commented Jun 24, 2023

aflaxman commented Jun 25, 2023

RossKen left a comment

RossKen commented Jun 30, 2023

ADBond commented Jul 3, 2023

ADBond commented Jul 3, 2023

RossKen commented Jul 4, 2023

RossKen commented Jul 4, 2023

aliceoleary0 commented Aug 17, 2023

RossKen commented Aug 17, 2023

Splink dummy data #1358

Splink dummy data #1358

Conversation

ADBond commented Jun 23, 2023 • edited Loading

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

aflaxman commented Jun 23, 2023

RobinL commented Jun 24, 2023 • edited Loading

RobinL commented Jun 24, 2023

aflaxman commented Jun 25, 2023

RossKen left a comment

Choose a reason for hiding this comment

RossKen commented Jun 30, 2023

ADBond commented Jul 3, 2023

ADBond commented Jul 3, 2023

RossKen commented Jul 4, 2023

RossKen commented Jul 4, 2023

aliceoleary0 commented Aug 17, 2023

RossKen commented Aug 17, 2023

ADBond commented Jun 23, 2023 •

edited

Loading

RobinL commented Jun 24, 2023 •

edited

Loading