Datasets for benchmarking strategies #8

paulmorio · 2022-04-27T15:48:16Z

Modules to download and process datasets from online sources intotorch.utils.data.Dataset instances, with additional attributes for (stratified) k-fold CV as described in the paper.

This incurs a few new dependencies namely openpyxl, xlrd, and pyreadr for processing the excel and R data storage formats for original raw datas.

Also included are utility functions for transforming each of the datasets into datamanagers that have "cold" or "warm" label initialisations for benchmarking AL strategies on the datasets.

Includes tests for all the modules implemented

…eprocessing of datasets on the UCI database

…UCI (and some others that are stored as R mat files or excel xls files)

…atasets

…torage on some UCI datasets

…l dataset instead of just the train portion using concatdataset

paulmorio · 2022-05-02T15:44:53Z

Updated with some tests for uci_datasets I looked into updating the coverage dynamically and the best solution I've come across so far is described here for github actions: https://github.com/marketplace/actions/dynamic-badges

Unfortunately I don't have the rights to access the secrets settings for the repo so I can't finish the instructions there.

I've added a reference documentation page. This could be followed by another tutorial at a later date.

thomasgaudelet

Nice!

.pre-commit-config.yaml

pyrelational/datasets/uci_datasets.py

paulmorio added 30 commits April 20, 2022 18:31

Ignore experiment logs generated in the examples

4ef7ecb

Initial commit with UCIDatasets module for downloading and initial pr…

c390aff

…eprocessing of datasets on the UCI database

SynthClass datasets as in Yang and Loog paper

a8dff40

Adding pyreadr and xlrd as dependencies for processing datasets from …

6c07574

…UCI (and some others that are stored as R mat files or excel xls files)

DocStrings on SynthClass datasets

1518c8c

Adding Breast Cancer and Digits datasets

20a43fd

Fashion MNIST added

d2ad56a

Fashion MNIST added

2774b80

Added UCI Classification datasets

754e231

Striatum dataset

7f981c9

GaussianClouds dataset from Konyushkova et al

bdf8207

Checkerboard datasets from Konyushkova et al

55ff93f

CreditCard Dataset from Dal Pozzolo et al.

4103f58

Bringing over regression datasets

6f9d703

SynthReg2 and Diabetes dataset from Efron et al.

bccf4f0

UCI Regression datasets

4374971

Packaging and work on datamanager generation for benchmarks

b8bd36b

Cold start data manager functions for classification and regression d…

e6c5f51

…atasets

Initial work on tests for new modules

5238390

Ignore experiment_logs generated running pytest from project root

32f7309

Tests for regression datasets

9912e1e

Making UCIDatasets download to tmp on default

d9bb499

Adding openpyxl as dependency to read excel sheets used as raw data s…

7d53aa0

…torage on some UCI datasets

Tests for UCI regression datasets

9bd10e0

SynthClass tests

3d6683d

Moved download default to /tmp/ and adjusted k fold to occur over ful…

5e160b1

…l dataset instead of just the train portion using concatdataset

Adding BreastCancer, Digit and FashionMNIST tests

8336c20

UCI dataset tests

cd94f0b

Tests for datasets in Konyushkova et al

c3c8940

Tests for datamanager al benchmark generators

b2cb389

paulmorio added 3 commits May 2, 2022 16:19

Docs for classification dataset modules

24484e0

Docs for regression dataset modules

4acec58

Reformatting with accompanying text on the page

9fabf6e

thomasgaudelet previously approved these changes May 9, 2022

View reviewed changes

thomasgaudelet requested review from a-pouplin and cristianregep May 10, 2022 09:54

a-pouplin previously approved these changes May 11, 2022

View reviewed changes

Deprecating dataset

f6c0513

paulmorio dismissed stale reviews from a-pouplin and thomasgaudelet via f6c0513 May 22, 2022 15:26

paulmorio added 7 commits May 22, 2022 16:32

Removing test for deprecated datasets

5fbe6d4

Deprecating dataset

4cfe919

Updating test for deprecated datasets

acd60a4

Initial work on documentation for the datasets

d8f778c

Updating tutorial on classification datasets

66056b3

Updating tutorial on regression datasets

47d1648

Updating tutorial on regression datasets

edde292

thomasgaudelet previously approved these changes May 23, 2022

View reviewed changes

jyperion reviewed May 24, 2022

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

jyperion reviewed May 24, 2022

View reviewed changes

pyrelational/datasets/uci_datasets.py Outdated Show resolved Hide resolved

jyperion reviewed May 24, 2022

View reviewed changes

pyrelational/datasets/uci_datasets.py Show resolved Hide resolved

a-pouplin previously approved these changes May 24, 2022

View reviewed changes

Updating uci_datasets to address redundant shuffle

110cc48

paulmorio dismissed stale reviews from a-pouplin and thomasgaudelet via 110cc48 May 25, 2022 12:09

Updating README

a615c28

thomasgaudelet approved these changes May 25, 2022

View reviewed changes

paulmorio merged commit 86899d1 into main May 26, 2022

thomasgaudelet deleted the datasets branch May 30, 2022 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Datasets for benchmarking strategies #8

Datasets for benchmarking strategies #8

Uh oh!

paulmorio commented Apr 27, 2022

Uh oh!

paulmorio commented May 2, 2022

Uh oh!

thomasgaudelet left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Datasets for benchmarking strategies #8

Datasets for benchmarking strategies #8

Uh oh!

Conversation

paulmorio commented Apr 27, 2022

Uh oh!

paulmorio commented May 2, 2022

Uh oh!

thomasgaudelet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants