Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
4ef7ecb
Ignore experiment logs generated in the examples
paulmorio Apr 20, 2022
c390aff
Initial commit with UCIDatasets module for downloading and initial pr…
paulmorio Apr 20, 2022
a8dff40
SynthClass datasets as in Yang and Loog paper
paulmorio Apr 20, 2022
6c07574
Adding pyreadr and xlrd as dependencies for processing datasets from …
paulmorio Apr 20, 2022
1518c8c
DocStrings on SynthClass datasets
paulmorio Apr 20, 2022
20a43fd
Adding Breast Cancer and Digits datasets
paulmorio Apr 20, 2022
d2ad56a
Fashion MNIST added
paulmorio Apr 20, 2022
2774b80
Fashion MNIST added
paulmorio Apr 20, 2022
754e231
Added UCI Classification datasets
paulmorio Apr 20, 2022
7f981c9
Striatum dataset
paulmorio Apr 20, 2022
bdf8207
GaussianClouds dataset from Konyushkova et al
paulmorio Apr 20, 2022
55ff93f
Checkerboard datasets from Konyushkova et al
paulmorio Apr 20, 2022
4103f58
CreditCard Dataset from Dal Pozzolo et al.
paulmorio Apr 21, 2022
6f9d703
Bringing over regression datasets
paulmorio Apr 26, 2022
bccf4f0
SynthReg2 and Diabetes dataset from Efron et al.
paulmorio Apr 26, 2022
4374971
UCI Regression datasets
paulmorio Apr 26, 2022
b8bd36b
Packaging and work on datamanager generation for benchmarks
paulmorio Apr 26, 2022
e6c5f51
Cold start data manager functions for classification and regression d…
paulmorio Apr 26, 2022
5238390
Initial work on tests for new modules
paulmorio Apr 26, 2022
32f7309
Ignore experiment_logs generated running pytest from project root
paulmorio Apr 26, 2022
9912e1e
Tests for regression datasets
paulmorio Apr 26, 2022
d9bb499
Making UCIDatasets download to tmp on default
paulmorio Apr 26, 2022
7d53aa0
Adding openpyxl as dependency to read excel sheets used as raw data s…
paulmorio Apr 26, 2022
9bd10e0
Tests for UCI regression datasets
paulmorio Apr 26, 2022
3d6683d
SynthClass tests
paulmorio Apr 26, 2022
5e160b1
Moved download default to /tmp/ and adjusted k fold to occur over ful…
paulmorio Apr 26, 2022
8336c20
Adding BreastCancer, Digit and FashionMNIST tests
paulmorio Apr 26, 2022
cd94f0b
UCI dataset tests
paulmorio Apr 26, 2022
c3c8940
Tests for datasets in Konyushkova et al
paulmorio Apr 26, 2022
b2cb389
Tests for datamanager al benchmark generators
paulmorio Apr 27, 2022
270ab5b
Linting on the tests
paulmorio Apr 27, 2022
b7bdac3
Linting on benchmark datamanager
paulmorio Apr 27, 2022
01837e1
Linting on the classification datasets module
paulmorio Apr 27, 2022
12a9233
Linting for dataset downloading modules
paulmorio Apr 27, 2022
5590144
CreditCardDataset added, tested, and linted
paulmorio Apr 27, 2022
3242b3a
Further linting for code quality standards
paulmorio Apr 27, 2022
1297a0d
Linted with updated black
paulmorio Apr 27, 2022
db6ca1e
Removing unused variables in dataset
paulmorio Apr 27, 2022
ea60572
Updating black revision in pre-commit due to issues breaking CI descr…
paulmorio Apr 27, 2022
65a89c6
Adding pyreadr, xlrd, openpyxl as requirements and updating required …
paulmorio Apr 27, 2022
a8dd74a
Commenting tests which require download of data onto specific /tmp/ d…
paulmorio Apr 27, 2022
fffdc29
Tests will download and store data within project instead of /tmp/ fo…
paulmorio Apr 27, 2022
5649d2f
add error print to test hook
Apr 28, 2022
a60b5e4
Removed redundant tensor operation for FashionMNIST and made creation…
paulmorio Apr 28, 2022
6fb8bf0
Merge branch 'datasets' of https://github.com/RelationRx/pyrelational…
paulmorio Apr 28, 2022
29cf397
Unit test for UCI dataset downloader
paulmorio May 2, 2022
24484e0
Docs for classification dataset modules
paulmorio May 2, 2022
4acec58
Docs for regression dataset modules
paulmorio May 2, 2022
9fabf6e
Reformatting with accompanying text on the page
paulmorio May 2, 2022
f6c0513
Deprecating dataset
paulmorio May 22, 2022
5fbe6d4
Removing test for deprecated datasets
paulmorio May 22, 2022
4cfe919
Deprecating dataset
paulmorio May 22, 2022
acd60a4
Updating test for deprecated datasets
paulmorio May 22, 2022
d8f778c
Initial work on documentation for the datasets
paulmorio May 22, 2022
66056b3
Updating tutorial on classification datasets
paulmorio May 22, 2022
47d1648
Updating tutorial on regression datasets
paulmorio May 23, 2022
edde292
Updating tutorial on regression datasets
paulmorio May 23, 2022
110cc48
Updating uci_datasets to address redundant shuffle
paulmorio May 25, 2022
a615c28
Updating README
paulmorio May 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,9 @@ jobs:
- name: Test with pytest
run: |
python -m pytest --cache-clear --cov=pyrelational tests > pytest-coverage.txt
- name: Print error
if: failure()
run: |
cat pytest-coverage.txt
- name: Comment coverage
uses: coroo/pytest-coverage-commentator@v1.0.2
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

# Dev files
deprecated/
examples/demo/experiment_logs/
experiment_logs/
test_data/

# Checkpoints
checkpoints/
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ repos:
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 21.12b0
rev: 22.3.0
hooks:
- id: black
- repo: https://github.com/PyCQA/flake8
Expand Down
145 changes: 79 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# PyRelationAL


<p>
<a alt="coverage">
<img src="https://img.shields.io/badge/coverage-93%25-green" /></a>
<img src="https://img.shields.io/badge/coverage-94%25-green" /></a>
<a alt="semver">
<img src="https://img.shields.io/badge/semver-0.1.5-blue" /></a>
<a alt="documentation" href="https://pyrelational.readthedocs.io/en/latest/index.html">
Expand All @@ -12,20 +11,27 @@
<img src="https://img.shields.io/badge/pypi-online-yellow" /></a>
</p>

### Quick install
PyRelationAL is an open source Python library for the rapid and reliable construction of active learning (AL) pipelines and strategies. The toolkit offers a modular design for a flexible workflow that enables active learning with as little change to your models and datasets as possible. The package is primarily aimed at researchers so that they can rapidly reimplement, adapt, and create novel active learning strategies. For more information on how we achieve this you can consult the sections below, our comprehensive docs, or our paper. PyRelationAL is principally designed with PyTorch workflows in mind but can easily be extended to work with other ML frameworks.

`pip install pyrelational`
Detailed in the **overview** section below, PyRelationAL offers:

### Organisation of repository
- Data management in AL pipelines (*DataManager*)
- Wrappers for models to be used in AL workflows and strategies (*Model Manager*)
- (Optional) Ensembling and Bayesian inference approximation for point estimate models to quantifying uncertainty from point-estimate models (*Uncertainty estimation*).
- Active learning strategies and templates for making your own! (*Active learning strategy*)
- Benchmark datasets: an API for downloading datasets and AL task configurations based on literature for more standardised and painfree benchmarking.

- `pyrelational` folder contains the source code for the PyRelationAL package. It contains the main sub-packages for active learning strategies, various informativeness measures, and methods for estimating posterior uncertainties.
- `examples` folder contains various example scripts and notebooks detailing how the package can be used
- `tests` folder contains unit tests for pyrelational package
- `docs` folder contains docs and assets for docs
One of our main incentives for making this library is to get more people interested in research and development of AL. Hence we have made primers, tutorials, and examples available on our website for newcomers (and experience AL practitioners alike). Experienced users can refer to our numerous examples to get started on their AL projects.

### The `PyRelationAL` package
## Quick install

#### Example
```bash
pip install pyrelational
```

## The `PyRelationAL` package

### Example

```python
# Active Learning package
Expand All @@ -48,48 +54,63 @@ al_manager.theoretical_performance(test_loader=test_loader)
al_manager.full_active_learning_run(num_annotate=100, test_loader=test_loader)
```

#### Overview
## Overview

![Overview](docs/images/active_learning_loop.png "Overview")

The PyRelationAL package offers a flexible workflow to enable active learning with as little change to the models and datasets as possible. It is partially inspired by Robert (Munro) Monarch's book: "Human-In-The-Loop Machine Learning" and shares some vocabulary from there. It is principally designed with PyTorch in mind, but can be easily extended to work with other libraries.
The `PyRelationAL` package decomposes the active learning workflow into four main components: 1) a **data manager**, 2) a **model**, 3) an **AL strategy** built around an informativeness function, and 4) an **oracle** (see Figure above). Note that the oracle is external to the package.

For a primer on active learning, we refer the reader to Burr Settles's survey [[reference](https://burrsettles.com/pub/settles.activelearning.pdf)]. In his own words
> The key idea behind active learning is that a machine learning algorithm can
achieve greater accuracy with fewer training labels if it is allowed to choose the
data from which it learns. An active learner may pose queries, usually in the form
of unlabeled data instances to be labeled by an oracle (e.g., a human annotator).
Active learning is well-motivated in many modern machine learning problems,
where unlabeled data may be abundant or easily obtained, but labels are difficult,
time-consuming, or expensive to obtain.
The **data manager** (defined in `pyrelational.data.data_manager.GenericDataManager`) wraps around a PyTorch Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools.

![Overview](docs/images/active_learning_loop.png "Overview")
The **model** (extending `pyrelational.models.generic_model.GenericModel`) wraps a user defined ML model (e.g. PyTorch Module, Flax module, or scikit-learn estimator) and handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout) if relevant. It also enables using ML models implemented using different ML frameworks (for example see `examples/demo/model_gaussianprocesses.py` or `examples/demo/scikit_estimator.py`).

The `PyRelationAL` package decomposes the active learning workflow into four main components: 1) a **data manager**, 2) a **model**, 3) an **acquisition strategy** built around informativeness scorer, and 4) an **oracle** (see Figure above). Note that the oracle is external to the package.
The **AL strategy** (extending `pyrelational.strategies.generic_al_strategy.GenericActiveLearningStrategy`) defines an active learning strategy via an *informativeness measure* and a *query selection algorithm*. Together they compute the utility of a query or set of queries for a batch active mode strategy. We define various classic strategies for classification, regression, and task-agnostic scenarios based on the informativeness measures defined in `pyrelational.informativeness`. The flexible nature of the `GenericActiveLearningStrategy` allows for the construction of strategies from simple serial uncertainty sampling approaches to complex agents that leverage several informativeness measures, state and learning based query selection algorithms, with query batch building bandits under uncertainty from noisy oracles.

The data manager (defined in `pyrelational.data.data_manager.GenericDataManager`) wraps around a PyTorch Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools.
In addition to the main modules above we offer tools for **uncertainty estimation**. In recognition of the growing use of deep learning models we offer a suite of methods for Bayesian inference approximation to quantify uncertainty coming from the functional model such as MCDropout and ensembles of models (which may be used to also define query by committee and query by disagreement strategies).

The model (subclassed from `pyrelational.models.generic_model.GenericModel`) wraps a user defined ML model (e.g. PyTorch Module, Pytorch Lightning Module, or scikit-learn estimator) and handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout). It also enables using ML models that directly estimate their uncertainties such as Gaussian Processes (see `examples/demo/model_gaussianprocesses.py`).
Finally we to help test and benchmark strategies we offer **Benchmark datasets** and **AL task configurations**. We offer an API to a selection of datasets used previously in AL literature and offer each in several AL task configurations, such as cold and warm initialisations, for pain free benchmarking. For more details see our paper and documentation.

The active learning strategy (which subclass `pyrelational.strategies.generic_al_strategy.GenericActiveLearningStrategy`) revolves around an informativeness score that serve as the basis for the selection of the query sent to the oracle for labelling. We define various strategies for classification, regression, and task-agnostic scenarios based on informativeness scorer defined in `pyrelational.informativeness`.
In the next section we briefly outline currently available strategies, informativeness measures, uncertainty estimation methods and some planned modules.

## Prerequisites and setup
### List of included strategies and uncertainty estimation methods (constantly growing!)

For those just using the package, installation only requires standard ML packages and PyTorch. Starting with a new virtual environment (miniconda environment recommended), install standard learning packages and numerical tools.
#### Uncertainty Estimation

```bash
pip install -r requirements.txt
```
- MCDropout
- Ensemble of models (a.k.a. commitee)
- DropConnect (coming soon)
- SWAG (coming soon)
- MultiSWAG (coming soon)

If you wish to contribute to the code, run `pre-commit install` after the above step.
#### Informativeness measures included in the library

## Building the docs
##### Regression (N.B. PyRelationAL currently only supports single scalar regression tasks)

Make sure you have `sphinx` and `sphinx-rtd-theme` packages installed (`pip install sphinx sphinx_rtd_theme` will install this).
- Greedy score
- Least confidence score
- Expected improvement score
- Thompson sampling score
- Upper confidence bound (UCB) score
- BALD

To generate the docs, `cd` into the `docs/` directory and run `make html`. This will generate the docs
at `docs/_build/html/index.html`.
##### Classification (N.B. PyRelationAL does not support multi-label classification at the moment)

- Least confidence
- Margin confidence
- Entropy based confidence
- Ratio based confidence
- BALD
- Thompson Sampling (coming soon)
- BatchBALD (coming soon)


##### Model agnostic and diversity sampling based approaches

- Representative sampling
- Diversity sampling
- Random acquisition
- BADGE

## Quickstart & examples
The `examples/` folder contains multiple scripts and notebooks demonstrating how to use PyRelationAL effectively.

Expand Down Expand Up @@ -121,47 +142,39 @@ The diverse examples scripts and notebooks aim to showcase how to use pyrelation
- `gpytorch_integration.py`
- `model_badge.py`

- examples custom acquisition strategy
- examples on how to create a custom acquisition strategy
- `model_badge.py`
- `lightning_mixed_regression.py`

- examples custom model
- examples using different ML frameworks
- `model_gaussianprocesses.py`
- `scikit_estimator.py`

## Uncertainty Estimation

- MCDropout
- Ensemble of models (a.k.a. commitee)
- DropConnect (coming soon)
- SWAG (coming soon)
- MultiSWAG (coming soon)
## Contributing to PyRelationAL

## Informativeness scorer included in the library
We welcome contributions to PyRelationAL, please see and adhere to the `CONTRIBUTING.md` and `CODE_OF_CONDUCT.md` guidelines.

### Regression (N.B. PyRelationAL currently only supports single scalar regression tasks)
### Prerequisites and setup

- Greedy
- Least confidence
- Expected improvement
- Thompson Sampling
- Upper confidence bound (UCB)
- BALD
- BatchBALD (coming soon)
For those just using the package, installation only requires standard ML packages and PyTorch. More advanced users or those wishing to contribute should start with a new virtual environment (miniconda environment recommended) and install standard learning packages and numerical tools.

### Classification (N.B. PyRelationAL does not support multi-label classification at the moment)
```bash
pip install -r requirements.txt
```

- Least confidence
- Margin confidence
- Entropy based confidence
- Ratio based confidence
- BALD
- Thompson Sampling (coming soon)
- BatchBALD (coming soon)
If you wish to contribute to the code, run `pre-commit install` after the above step.

### Organisation of repository

### Model agnostic and diversity sampling based approaches
- `pyrelational` folder contains the source code for the PyRelationAL package. It contains the main sub-packages for active learning strategies, various informativeness measures, and methods for estimating posterior uncertainties.
- `examples` folder contains various example scripts and notebooks detailing how the package can be used to construct novel strategies, work with different ML frameworks, and use your own data
- `tests` folder contains unit tests for pyrelational package
- `docs` folder contains documentation and assets for docs

- Representative sampling
- Diversity sampling
- Random acquisition
- BADGE
### Building the docs

Make sure you have `sphinx` and `sphinx-rtd-theme` packages installed (`pip install sphinx sphinx_rtd_theme` will install this).

To generate the docs, `cd` into the `docs/` directory and run `make html`. This will generate the docs
at `docs/_build/html/index.html`.
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,15 @@ rapidly implementing active learning pipelines from data management, model devel
notes/using_your_own_data
notes/using_the_model_api
notes/using_your_own_strategy
notes/benchmark_datasets

.. toctree::
:glob:
:maxdepth: 2
:caption: Package modules

reference/data.rst
reference/datasets.rst
reference/models.rst
reference/informativeness.rst
reference/strategies.rst
Expand Down
75 changes: 75 additions & 0 deletions docs/source/notes/benchmark_datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
.. _benchmark_datasets:

Benchmark datasets and AL task configurations
=============================================
A fundamental assumption in evaluating active learning strategies is that there exists a labelled subset of a training dataset that allows a model to perform as well (on the holdout test set) as using the entire training set. In evaluating an AL strategy we are interested in finding this subset efficiently, and maximising performance in an efficient manner.

To help users benchmark their strategies and active learning pipelines we have collected a range of datasets that have been used for benchmarking strategies in AL literature [#f1]_ . We provide classification and regression type datasets from a range of real world applications. Additionally we provide utilities to help create **cold** and **warm** start label initialisations corresponding to different active learning tasks to also help evaluate your strategy in these scenarios. More on these on the respective sections below.

This short tutorial will cover using the `datasets` subpackage containing classes that will download and process raw data into PyTorch Datasets that are ready for use with our DataManager classes. These extend completely standard PyTorch Dataset objects and can be used for normal ML experimentation as well. Each of the datasets will have additional parameters which describe the splitting of the dataset for cross-validation experiments, these are seeded for easier reproduction.

We hope that this resource helps make horizontal analysis of AL strategies across a range of datasets and
AL tasks easier. Better yet, lets hope it will garner interest in establishing a set of challening active learning benchmarks and tasks that can set a standard for the AL field.

Example usage: classification dataset
-------------------------------------

In this example we will look at the Wisconsin Breast Cancer (diagnostic) dataset [#f2]_ . It can be downloaded and processed with

.. code-block:: python

from pyrelational.datasets import BreastCancerDataset
dataset = BreastCancerDataset(n_splits = 5)

Where the `n_splits` argument specifies the number of train-test splits should be computed. For classification datasets the splits will be stratified by class. The `dataset` variable will behave like a regular PyTorch Dataset and is compatible with their excellent DataLoaders.

The `create_warm_start()` and `create_classification_cold_start()` functions in `pyrelational.datasets.benchmark_datamanager` will generate PyRelationAL DataManager objects corresponding to the following AL learning tasks inspired by Konyushkova et al. [#f3]_ .

- **Cold-start classification**: 1 observation for each class represented in the training set is labelled and the rest unlabeled.
- **Warm-start classification**: a randomly sampled 10 percent of the training set is labelled, the rest is unlabelled.

The following code snippet will return a DataManager corresponding to a cold-start initialisation for the breast cancer classification dataset using one of the precomputed splits:

.. code-block:: python

from pyrelational.datasets import BreastCancerDataset
dataset = BreastCancerDataset()
train_indices = list(dataset.data_splits[0][0])
test_indices = list(dataset.data_splits[0][1])
dm = create_classification_cold_start(dataset, train_indices=train_indices, test_indices=test_indices)


Example usage: regression dataset
---------------------------------

This example will be identical to the classification case, except of course adjusted to be applicable for the regression ML task. We will use the UCI Diabetes dataset [#f4]_ . This can be downloaded and processed with

.. code-block:: python

from pyrelational.datasets import DiabetesDataset
dataset = DiabetesDataset(n_splits = 5)

As before the `n_splits` argument specifies the number of train-test splits that should be computed for the cross-validation setup. For regression these will be random splits, not stratified as in the classification case.

The `create_warm_start()` and `create_regression_cold_start()` functions in `pyrelational.datasets.benchmark_datamanager` will generate PyRelationAL DataManager objects corresponding to the following AL learning tasks inspired by Konyushkova et al. [#f3]_ .

- **Cold-start regression**: the two observations with highest euclidean pairwise distance in the train set are labelled, the rest is unlabelled.
- **Warm-start regression**: a randomly sampled 10 percent of the training set is labelled, the rest is unlabelled.

The following code snippet will return a DataManager corresponding to a cold-start initialisation for the diabetes regression dataset using one of the precomputed splits:

.. code-block:: python

dataset = DiabetesDataset()
train_indices = list(dataset.data_splits[0][0])
test_indices = list(dataset.data_splits[0][1])
dm = create_regression_cold_start(dataset, train_indices=train_indices, test_indices=test_indices)

We welcome any contributions to adding datasets and AL task configurations given they are justified by AL literature or make a convincing case for addition as a benchmark for AL strategies.

.. rubric:: Footnotes

.. [#f1] Please see the datasets API reference for a full listing
.. [#f2] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
.. [#f3] Learning Active Learning from Data from Konyushkova et al. NeurIPS 2017 (publicly available via https://arxiv.org/abs/1703.03365)
.. [#f4] https://archive.ics.uci.edu/ml/datasets/diabetes
Loading