-
Notifications
You must be signed in to change notification settings - Fork 13
Datasets for benchmarking strategies #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
4ef7ecb
Ignore experiment logs generated in the examples
paulmorio c390aff
Initial commit with UCIDatasets module for downloading and initial pr…
paulmorio a8dff40
SynthClass datasets as in Yang and Loog paper
paulmorio 6c07574
Adding pyreadr and xlrd as dependencies for processing datasets from …
paulmorio 1518c8c
DocStrings on SynthClass datasets
paulmorio 20a43fd
Adding Breast Cancer and Digits datasets
paulmorio d2ad56a
Fashion MNIST added
paulmorio 2774b80
Fashion MNIST added
paulmorio 754e231
Added UCI Classification datasets
paulmorio 7f981c9
Striatum dataset
paulmorio bdf8207
GaussianClouds dataset from Konyushkova et al
paulmorio 55ff93f
Checkerboard datasets from Konyushkova et al
paulmorio 4103f58
CreditCard Dataset from Dal Pozzolo et al.
paulmorio 6f9d703
Bringing over regression datasets
paulmorio bccf4f0
SynthReg2 and Diabetes dataset from Efron et al.
paulmorio 4374971
UCI Regression datasets
paulmorio b8bd36b
Packaging and work on datamanager generation for benchmarks
paulmorio e6c5f51
Cold start data manager functions for classification and regression d…
paulmorio 5238390
Initial work on tests for new modules
paulmorio 32f7309
Ignore experiment_logs generated running pytest from project root
paulmorio 9912e1e
Tests for regression datasets
paulmorio d9bb499
Making UCIDatasets download to tmp on default
paulmorio 7d53aa0
Adding openpyxl as dependency to read excel sheets used as raw data s…
paulmorio 9bd10e0
Tests for UCI regression datasets
paulmorio 3d6683d
SynthClass tests
paulmorio 5e160b1
Moved download default to /tmp/ and adjusted k fold to occur over ful…
paulmorio 8336c20
Adding BreastCancer, Digit and FashionMNIST tests
paulmorio cd94f0b
UCI dataset tests
paulmorio c3c8940
Tests for datasets in Konyushkova et al
paulmorio b2cb389
Tests for datamanager al benchmark generators
paulmorio 270ab5b
Linting on the tests
paulmorio b7bdac3
Linting on benchmark datamanager
paulmorio 01837e1
Linting on the classification datasets module
paulmorio 12a9233
Linting for dataset downloading modules
paulmorio 5590144
CreditCardDataset added, tested, and linted
paulmorio 3242b3a
Further linting for code quality standards
paulmorio 1297a0d
Linted with updated black
paulmorio db6ca1e
Removing unused variables in dataset
paulmorio ea60572
Updating black revision in pre-commit due to issues breaking CI descr…
paulmorio 65a89c6
Adding pyreadr, xlrd, openpyxl as requirements and updating required …
paulmorio a8dd74a
Commenting tests which require download of data onto specific /tmp/ d…
paulmorio fffdc29
Tests will download and store data within project instead of /tmp/ fo…
paulmorio 5649d2f
add error print to test hook
a60b5e4
Removed redundant tensor operation for FashionMNIST and made creation…
paulmorio 6fb8bf0
Merge branch 'datasets' of https://github.com/RelationRx/pyrelational…
paulmorio 29cf397
Unit test for UCI dataset downloader
paulmorio 24484e0
Docs for classification dataset modules
paulmorio 4acec58
Docs for regression dataset modules
paulmorio 9fabf6e
Reformatting with accompanying text on the page
paulmorio f6c0513
Deprecating dataset
paulmorio 5fbe6d4
Removing test for deprecated datasets
paulmorio 4cfe919
Deprecating dataset
paulmorio acd60a4
Updating test for deprecated datasets
paulmorio d8f778c
Initial work on documentation for the datasets
paulmorio 66056b3
Updating tutorial on classification datasets
paulmorio 47d1648
Updating tutorial on regression datasets
paulmorio edde292
Updating tutorial on regression datasets
paulmorio 110cc48
Updating uci_datasets to address redundant shuffle
paulmorio a615c28
Updating README
paulmorio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| .. _benchmark_datasets: | ||
|
|
||
| Benchmark datasets and AL task configurations | ||
| ============================================= | ||
| A fundamental assumption in evaluating active learning strategies is that there exists a labelled subset of a training dataset that allows a model to perform as well (on the holdout test set) as using the entire training set. In evaluating an AL strategy we are interested in finding this subset efficiently, and maximising performance in an efficient manner. | ||
|
|
||
| To help users benchmark their strategies and active learning pipelines we have collected a range of datasets that have been used for benchmarking strategies in AL literature [#f1]_ . We provide classification and regression type datasets from a range of real world applications. Additionally we provide utilities to help create **cold** and **warm** start label initialisations corresponding to different active learning tasks to also help evaluate your strategy in these scenarios. More on these on the respective sections below. | ||
|
|
||
| This short tutorial will cover using the `datasets` subpackage containing classes that will download and process raw data into PyTorch Datasets that are ready for use with our DataManager classes. These extend completely standard PyTorch Dataset objects and can be used for normal ML experimentation as well. Each of the datasets will have additional parameters which describe the splitting of the dataset for cross-validation experiments, these are seeded for easier reproduction. | ||
|
|
||
| We hope that this resource helps make horizontal analysis of AL strategies across a range of datasets and | ||
| AL tasks easier. Better yet, lets hope it will garner interest in establishing a set of challening active learning benchmarks and tasks that can set a standard for the AL field. | ||
|
|
||
| Example usage: classification dataset | ||
| ------------------------------------- | ||
|
|
||
| In this example we will look at the Wisconsin Breast Cancer (diagnostic) dataset [#f2]_ . It can be downloaded and processed with | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from pyrelational.datasets import BreastCancerDataset | ||
| dataset = BreastCancerDataset(n_splits = 5) | ||
|
|
||
| Where the `n_splits` argument specifies the number of train-test splits should be computed. For classification datasets the splits will be stratified by class. The `dataset` variable will behave like a regular PyTorch Dataset and is compatible with their excellent DataLoaders. | ||
|
|
||
| The `create_warm_start()` and `create_classification_cold_start()` functions in `pyrelational.datasets.benchmark_datamanager` will generate PyRelationAL DataManager objects corresponding to the following AL learning tasks inspired by Konyushkova et al. [#f3]_ . | ||
|
|
||
| - **Cold-start classification**: 1 observation for each class represented in the training set is labelled and the rest unlabeled. | ||
| - **Warm-start classification**: a randomly sampled 10 percent of the training set is labelled, the rest is unlabelled. | ||
|
|
||
| The following code snippet will return a DataManager corresponding to a cold-start initialisation for the breast cancer classification dataset using one of the precomputed splits: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from pyrelational.datasets import BreastCancerDataset | ||
| dataset = BreastCancerDataset() | ||
| train_indices = list(dataset.data_splits[0][0]) | ||
| test_indices = list(dataset.data_splits[0][1]) | ||
| dm = create_classification_cold_start(dataset, train_indices=train_indices, test_indices=test_indices) | ||
|
|
||
|
|
||
| Example usage: regression dataset | ||
| --------------------------------- | ||
|
|
||
| This example will be identical to the classification case, except of course adjusted to be applicable for the regression ML task. We will use the UCI Diabetes dataset [#f4]_ . This can be downloaded and processed with | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from pyrelational.datasets import DiabetesDataset | ||
| dataset = DiabetesDataset(n_splits = 5) | ||
|
|
||
| As before the `n_splits` argument specifies the number of train-test splits that should be computed for the cross-validation setup. For regression these will be random splits, not stratified as in the classification case. | ||
|
|
||
| The `create_warm_start()` and `create_regression_cold_start()` functions in `pyrelational.datasets.benchmark_datamanager` will generate PyRelationAL DataManager objects corresponding to the following AL learning tasks inspired by Konyushkova et al. [#f3]_ . | ||
|
|
||
| - **Cold-start regression**: the two observations with highest euclidean pairwise distance in the train set are labelled, the rest is unlabelled. | ||
| - **Warm-start regression**: a randomly sampled 10 percent of the training set is labelled, the rest is unlabelled. | ||
|
|
||
| The following code snippet will return a DataManager corresponding to a cold-start initialisation for the diabetes regression dataset using one of the precomputed splits: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| dataset = DiabetesDataset() | ||
| train_indices = list(dataset.data_splits[0][0]) | ||
| test_indices = list(dataset.data_splits[0][1]) | ||
| dm = create_regression_cold_start(dataset, train_indices=train_indices, test_indices=test_indices) | ||
|
|
||
| We welcome any contributions to adding datasets and AL task configurations given they are justified by AL literature or make a convincing case for addition as a benchmark for AL strategies. | ||
|
|
||
| .. rubric:: Footnotes | ||
|
|
||
| .. [#f1] Please see the datasets API reference for a full listing | ||
| .. [#f2] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) | ||
| .. [#f3] Learning Active Learning from Data from Konyushkova et al. NeurIPS 2017 (publicly available via https://arxiv.org/abs/1703.03365) | ||
| .. [#f4] https://archive.ics.uci.edu/ml/datasets/diabetes |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.