RelationRx · paulmorio · May 26, 2022 · Apr 20, 2022 · Apr 20, 2022 · Apr 20, 2022
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -29,5 +29,9 @@ jobs:
       - name: Test with pytest
         run: |
           python -m pytest --cache-clear --cov=pyrelational tests > pytest-coverage.txt
+      - name: Print error
+        if: failure()
+        run: |
+          cat pytest-coverage.txt
       - name: Comment coverage
         uses: coroo/pytest-coverage-commentator@v1.0.2
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,9 @@
 
 # Dev files
 deprecated/
+examples/demo/experiment_logs/
+experiment_logs/
+test_data/
 
 # Checkpoints
 checkpoints/

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -6,7 +6,7 @@ repos:
     -   id: end-of-file-fixer
     -   id: trailing-whitespace
 -   repo: https://github.com/psf/black
-    rev: 21.12b0
+    rev: 22.3.0
     hooks:
     -   id: black
 -   repo: https://github.com/PyCQA/flake8

diff --git a/README.md b/README.md
@@ -1,9 +1,8 @@
 # PyRelationAL
 
-
 <p>
     <a alt="coverage">
-        <img src="https://img.shields.io/badge/coverage-93%25-green" /></a>
+        <img src="https://img.shields.io/badge/coverage-94%25-green" /></a>
     <a alt="semver">
         <img src="https://img.shields.io/badge/semver-0.1.5-blue" /></a>
     <a alt="documentation" href="https://pyrelational.readthedocs.io/en/latest/index.html">
@@ -12,20 +11,27 @@
         <img src="https://img.shields.io/badge/pypi-online-yellow" /></a>
 </p>
 
-### Quick install
+PyRelationAL is an open source Python library for the rapid and reliable construction of active learning (AL) pipelines and strategies. The toolkit offers a modular design for a flexible workflow that enables active learning with as little change to your models and datasets as possible. The package is primarily aimed at researchers so that they can rapidly reimplement, adapt, and create novel active learning strategies. For more information on how we achieve this you can consult the sections below, our comprehensive docs, or our paper. PyRelationAL is principally designed with PyTorch workflows in mind but can easily be extended to work with other ML frameworks.
 
-`pip install pyrelational`
+Detailed in the **overview** section below, PyRelationAL offers:
 
-### Organisation of repository
+- Data management in AL pipelines (*DataManager*)
+- Wrappers for models to be used in AL workflows and strategies (*Model Manager*)
+- (Optional) Ensembling and Bayesian inference approximation for point estimate models to quantifying uncertainty from point-estimate models (*Uncertainty estimation*).
+- Active learning strategies and templates for making your own! (*Active learning strategy*)
+- Benchmark datasets: an API for downloading datasets and AL task configurations based on literature for more standardised and painfree benchmarking.
 
-- `pyrelational` folder contains the source code for the PyRelationAL package. It contains the main sub-packages for active learning strategies, various informativeness measures, and methods for estimating posterior uncertainties.
-- `examples` folder contains various example scripts and notebooks detailing how the package can be used
-- `tests` folder contains unit tests for pyrelational package
-- `docs` folder contains docs and assets for docs
+One of our main incentives for making this library is to get more people interested in research and development of AL. Hence we have made primers, tutorials, and examples available on our website for newcomers (and experience AL practitioners alike). Experienced users can refer to our numerous examples to get started on their AL projects.
 
-### The `PyRelationAL` package
+## Quick install
 
-#### Example
+```bash
+pip install pyrelational
+```
+
+## The `PyRelationAL` package
+
+### Example
 
 ```python
 # Active Learning package
@@ -48,48 +54,63 @@ al_manager.theoretical_performance(test_loader=test_loader)
 al_manager.full_active_learning_run(num_annotate=100, test_loader=test_loader)
 ```
 
-#### Overview
+## Overview
 
+![Overview](docs/images/active_learning_loop.png "Overview")
 
-The PyRelationAL package offers a flexible workflow to enable active learning with as little change to the models and datasets as possible. It is partially inspired by Robert (Munro) Monarch's book: "Human-In-The-Loop Machine Learning" and shares some vocabulary from there. It is principally designed with PyTorch in mind, but can be easily extended to work with other libraries.
+The `PyRelationAL` package decomposes the active learning workflow into four main components: 1) a **data manager**, 2) a **model**, 3) an **AL strategy** built around an informativeness function, and 4) an **oracle** (see Figure above). Note that the oracle is external to the package.
 
-For a primer on active learning, we refer the reader to Burr Settles's survey [[reference](https://burrsettles.com/pub/settles.activelearning.pdf)]. In his own words
-> The key idea behind active learning is that a machine learning algorithm can
-achieve greater accuracy with fewer training labels if it is allowed to choose the
-data from which it learns. An active learner may pose queries, usually in the form
-of unlabeled data instances to be labeled by an oracle (e.g., a human annotator).
-Active learning is well-motivated in many modern machine learning problems,
-where unlabeled data may be abundant or easily obtained, but labels are difficult,
-time-consuming, or expensive to obtain.
+The **data manager** (defined in `pyrelational.data.data_manager.GenericDataManager`) wraps around a PyTorch Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools.
 
-![Overview](docs/images/active_learning_loop.png "Overview")
+The **model** (extending `pyrelational.models.generic_model.GenericModel`) wraps a user defined ML model (e.g. PyTorch Module, Flax module, or scikit-learn estimator) and handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout) if relevant. It also enables using ML models implemented using different ML frameworks (for example see `examples/demo/model_gaussianprocesses.py` or `examples/demo/scikit_estimator.py`).
 
-The `PyRelationAL` package decomposes the active learning workflow into four main components: 1) a **data manager**, 2) a **model**, 3) an **acquisition strategy** built around informativeness scorer, and 4) an **oracle** (see Figure above). Note that the oracle is external to the package.
+The **AL strategy** (extending `pyrelational.strategies.generic_al_strategy.GenericActiveLearningStrategy`) defines an active learning strategy via an *informativeness measure* and a *query selection algorithm*. Together they compute the utility of a query or set of queries for a batch active mode strategy. We define various classic strategies for classification, regression, and task-agnostic scenarios based on the informativeness measures defined in `pyrelational.informativeness`. The flexible nature of the `GenericActiveLearningStrategy` allows for the construction of strategies from simple serial uncertainty sampling approaches to complex agents that leverage several informativeness measures, state and learning based query selection algorithms, with query batch building bandits under uncertainty from noisy oracles.
 
-The data manager (defined in `pyrelational.data.data_manager.GenericDataManager`) wraps around a PyTorch Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools.
+In addition to the main modules above we offer tools for **uncertainty estimation**. In recognition of the growing use of deep learning models we offer a suite of methods for Bayesian inference approximation to quantify uncertainty coming from the functional model such as MCDropout and ensembles of models (which may be used to also define query by committee and query by disagreement strategies).
 
-The model (subclassed from `pyrelational.models.generic_model.GenericModel`) wraps a user defined ML model (e.g. PyTorch Module, Pytorch Lightning Module, or scikit-learn estimator) and handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout). It also enables using ML models that directly estimate their uncertainties such as Gaussian Processes (see `examples/demo/model_gaussianprocesses.py`).
+Finally we to help test and benchmark strategies we offer **Benchmark datasets** and **AL task configurations**. We offer an API to a selection of datasets used previously in AL literature and offer each in several AL task configurations, such as cold and warm initialisations, for pain free benchmarking. For more details see our paper and documentation.
 
-The active learning strategy (which subclass `pyrelational.strategies.generic_al_strategy.GenericActiveLearningStrategy`) revolves around an informativeness score that serve as the basis for the selection of the query sent to the oracle for labelling. We define various strategies for classification, regression, and task-agnostic scenarios based on informativeness scorer defined in `pyrelational.informativeness`.
+In the next section we briefly outline currently available strategies, informativeness measures, uncertainty estimation methods and some planned modules.
 
-## Prerequisites and setup
+### List of included strategies and uncertainty estimation methods (constantly growing!)
 
-For those just using the package, installation only requires standard ML packages and PyTorch. Starting with a new virtual environment (miniconda environment recommended), install standard learning packages and numerical tools.
+#### Uncertainty Estimation
 
-```bash
-pip install -r requirements.txt
-```
+- MCDropout
+- Ensemble of models (a.k.a. commitee)
+- DropConnect (coming soon)
+- SWAG (coming soon)
+- MultiSWAG (coming soon)
 
-If you wish to contribute to the code, run `pre-commit install` after the above step.
+#### Informativeness measures included in the library
 
-## Building the docs
+##### Regression (N.B. PyRelationAL currently only supports single scalar regression tasks)
 
-Make sure you have `sphinx` and `sphinx-rtd-theme` packages installed (`pip install sphinx sphinx_rtd_theme` will install this).
+- Greedy score
+- Least confidence score
+- Expected improvement score
+- Thompson sampling score
+- Upper confidence bound (UCB) score
+- BALD
 
-To generate the docs, `cd` into the `docs/` directory and run `make html`. This will generate the docs
-at `docs/_build/html/index.html`.
+##### Classification (N.B. PyRelationAL does not support multi-label classification at the moment)
+
+- Least confidence
+- Margin confidence
+- Entropy based confidence
+- Ratio based confidence
+- BALD
+- Thompson Sampling (coming soon)
+- BatchBALD (coming soon)
 
 
+##### Model agnostic and diversity sampling based approaches
+
+- Representative sampling
+- Diversity sampling
+- Random acquisition
+- BADGE
+
 ## Quickstart & examples
 The `examples/` folder contains multiple scripts and notebooks demonstrating how to use PyRelationAL effectively.
 
@@ -121,47 +142,39 @@ The diverse examples scripts and notebooks aim to showcase how to use pyrelation
   - `gpytorch_integration.py`
   - `model_badge.py`
 
-- examples custom acquisition strategy
+- examples on how to create a custom acquisition strategy
   - `model_badge.py`
   - `lightning_mixed_regression.py`
 
-- examples custom model
+- examples using different ML frameworks
   - `model_gaussianprocesses.py`
+  - `scikit_estimator.py`
 
-## Uncertainty Estimation
 
-- MCDropout
-- Ensemble of models (a.k.a. commitee)
-- DropConnect (coming soon)
-- SWAG (coming soon)
-- MultiSWAG (coming soon)
+## Contributing to PyRelationAL
 
-## Informativeness scorer included in the library
+We welcome contributions to PyRelationAL, please see and adhere to the `CONTRIBUTING.md` and `CODE_OF_CONDUCT.md` guidelines.
 
-### Regression (N.B. PyRelationAL currently only supports single scalar regression tasks)
+### Prerequisites and setup
 
-- Greedy
-- Least confidence
-- Expected improvement
-- Thompson Sampling
-- Upper confidence bound (UCB)
-- BALD
-- BatchBALD (coming soon)
+For those just using the package, installation only requires standard ML packages and PyTorch. More advanced users or those wishing to contribute should start with a new virtual environment (miniconda environment recommended) and install standard learning packages and numerical tools.
 
-### Classification (N.B. PyRelationAL does not support multi-label classification at the moment)
+```bash
+pip install -r requirements.txt
+```
 
-- Least confidence
-- Margin confidence
-- Entropy based confidence
-- Ratio based confidence
-- BALD
-- Thompson Sampling (coming soon)
-- BatchBALD (coming soon)
+If you wish to contribute to the code, run `pre-commit install` after the above step.
 
+### Organisation of repository
 
-### Model agnostic and diversity sampling based approaches
+- `pyrelational` folder contains the source code for the PyRelationAL package. It contains the main sub-packages for active learning strategies, various informativeness measures, and methods for estimating posterior uncertainties.
+- `examples` folder contains various example scripts and notebooks detailing how the package can be used to construct novel strategies, work with different ML frameworks, and use your own data
+- `tests` folder contains unit tests for pyrelational package
+- `docs` folder contains documentation and assets for docs
 
-- Representative sampling
-- Diversity sampling
-- Random acquisition
-- BADGE
+### Building the docs
+
+Make sure you have `sphinx` and `sphinx-rtd-theme` packages installed (`pip install sphinx sphinx_rtd_theme` will install this).
+
+To generate the docs, `cd` into the `docs/` directory and run `make html`. This will generate the docs
+at `docs/_build/html/index.html`.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -21,13 +21,15 @@ rapidly implementing active learning pipelines from data management, model devel
    notes/using_your_own_data
    notes/using_the_model_api
    notes/using_your_own_strategy
+   notes/benchmark_datasets
 
 .. toctree::
    :glob:
    :maxdepth: 2
    :caption: Package modules
 
    reference/data.rst
+   reference/datasets.rst
    reference/models.rst
    reference/informativeness.rst
    reference/strategies.rst

diff --git a/docs/source/notes/benchmark_datasets.rst b/docs/source/notes/benchmark_datasets.rst
@@ -0,0 +1,75 @@
+.. _benchmark_datasets:
+
+Benchmark datasets and AL task configurations
+=============================================
+A fundamental assumption in evaluating active learning strategies is that there exists a labelled subset of a training dataset that allows a model to perform as well (on the holdout test set) as using the entire training set. In evaluating an AL strategy we are interested in finding this subset efficiently, and maximising performance in an efficient manner.
+
+To help users benchmark their strategies and active learning pipelines we have collected a range of datasets that have been used for benchmarking strategies in AL literature [#f1]_ . We provide classification and regression type datasets from a range of real world applications. Additionally we provide utilities to help create **cold** and **warm** start label initialisations corresponding to different active learning tasks to also help evaluate your strategy in these scenarios. More on these on the respective sections below.
+
+This short tutorial will cover using the `datasets` subpackage containing classes that will download and process raw data into PyTorch Datasets that are ready for use with our DataManager classes. These extend completely standard PyTorch Dataset objects and can be used for normal ML experimentation as well. Each of the datasets will have additional parameters which describe the splitting of the dataset for cross-validation experiments, these are seeded for easier reproduction.
+
+We hope that this resource helps make horizontal analysis of AL strategies across a range of datasets and
+AL tasks easier. Better yet, lets hope it will garner interest in establishing a set of challening active learning benchmarks and tasks that can set a standard for the AL field.
+
+Example usage: classification dataset
+-------------------------------------
+
+In this example we will look at the Wisconsin Breast Cancer (diagnostic) dataset [#f2]_ . It can be downloaded and processed with
+
+.. code-block:: python
+
+    from pyrelational.datasets import BreastCancerDataset
+    dataset = BreastCancerDataset(n_splits = 5)
+
+Where the `n_splits` argument specifies the number of train-test splits should be computed. For classification datasets the splits will be stratified by class. The `dataset` variable will behave like a regular PyTorch Dataset and is compatible with their excellent DataLoaders.
+
+The `create_warm_start()` and `create_classification_cold_start()` functions in `pyrelational.datasets.benchmark_datamanager` will generate PyRelationAL DataManager objects corresponding to the following AL learning tasks inspired by Konyushkova et al. [#f3]_ .
+
+- **Cold-start classification**: 1 observation for each class represented in the training set is labelled and the rest unlabeled.
+- **Warm-start classification**: a randomly sampled 10 percent of the training set is labelled, the rest is unlabelled.
+
+The following code snippet will return a DataManager corresponding to a cold-start initialisation for the breast cancer classification dataset using one of the precomputed splits:
+
+.. code-block:: python
+
+    from pyrelational.datasets import BreastCancerDataset
+    dataset = BreastCancerDataset()
+    train_indices = list(dataset.data_splits[0][0])
+    test_indices = list(dataset.data_splits[0][1])
+    dm = create_classification_cold_start(dataset, train_indices=train_indices, test_indices=test_indices)
+
+
+Example usage: regression dataset
+---------------------------------
+
+This example will be identical to the classification case, except of course adjusted to be applicable for the regression ML task. We will use the UCI Diabetes dataset [#f4]_ . This can be downloaded and processed with
+
+.. code-block:: python
+
+    from pyrelational.datasets import DiabetesDataset
+    dataset = DiabetesDataset(n_splits = 5)
+
+As before the `n_splits` argument specifies the number of train-test splits that should be computed for the cross-validation setup. For regression these will be random splits, not stratified as in the classification case.
+
+The `create_warm_start()` and `create_regression_cold_start()` functions in `pyrelational.datasets.benchmark_datamanager` will generate PyRelationAL DataManager objects corresponding to the following AL learning tasks inspired by Konyushkova et al. [#f3]_ .
+
+- **Cold-start regression**: the two observations with highest euclidean pairwise distance in the train set are labelled, the rest is unlabelled.
+- **Warm-start regression**: a randomly sampled 10 percent of the training set is labelled, the rest is unlabelled.
+
+The following code snippet will return a DataManager corresponding to a cold-start initialisation for the diabetes regression dataset using one of the precomputed splits:
+
+.. code-block:: python
+
+    dataset = DiabetesDataset()
+    train_indices = list(dataset.data_splits[0][0])
+    test_indices = list(dataset.data_splits[0][1])
+    dm = create_regression_cold_start(dataset, train_indices=train_indices, test_indices=test_indices)
+
+We welcome any contributions to adding datasets and AL task configurations given they are justified by AL literature or make a convincing case for addition as a benchmark for AL strategies.
+
+.. rubric:: Footnotes
+
+.. [#f1] Please see the datasets API reference for a full listing
+.. [#f2] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
+.. [#f3] Learning Active Learning from Data from Konyushkova et al. NeurIPS 2017 (publicly available via https://arxiv.org/abs/1703.03365)
+.. [#f4] https://archive.ics.uci.edu/ml/datasets/diabetes