Update README

m-Just · May 3, 2022 · b3e4ab8 · b3e4ab8
1 parent 2140093
commit b3e4ab8
Showing 1 changed file with 27 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ This repository contains the code to produce the benchmark, which has two main c
 - Python 3.6 or above
 - The packages listed in `requirements.txt`. You can install them via `pip install -r requirements.txt`. Package `torch_scatter` may require a [manual installation](https://github.com/rusty1s/pytorch_scatter#installation)
 - Submodules are added to the path:
-```bash
+```sh
 export PYTHONPATH="$PYTHONPATH:$(pwd)/external/DomainBed/"
 export PYTHONPATH="$PYTHONPATH:$(pwd)/external/wilds/"
 ```
@@ -27,7 +27,7 @@ The quantification process consists of three main steps:
 (3) measuring the shifts with the extracted features.
 The module `ood_bench.scripts.main` will handle the whole process for you.
 For example, to quantify the distribution shift between the training environments (indexed by 0 and 1) and the test environment (indexed by 2) of [Colored MNIST](https://github.com/facebookresearch/InvariantRiskMinimization/blob/fc185d0f828a98f57030ba3647efc7394d1be95a/code/colored_mnist/main.py#L34) with 16 trials, you can simply run:
-```bash
+```sh
 python -m ood_bench.scripts.main\
        --n_trials 16\
        --data_dir /path/to/my/data\
@@ -44,14 +44,32 @@ These two optional arguments are also useful:
 - `--parallel`: utilize multiple GPUs to conduct the trials in parallel. The maximum number of parallel trials is the number of visible GPUs which can be controlled by setting `CUDA_VISIBLE_DEVICES`.
 - `--calibrate`: calibrate the thresholds `eps_div` and `eps_cor` so that the estimated diversity and correlation shift are ensured to be within a range close to 0 under i.i.d. condition.
 
-For more quantification examples on other datasets, see [`ood_bench/examples`](ood_bench/examples).
-Note that there will be some difference between the results produced by this implementation and those reported in our paper because we reworked the original implementation to ease public use and to improve quantification stability.
-One of the main improvements is the use of calibration whereas previously the same thresholds that are empirically sound are used across all the datasets studied in our paper (but this may not hold for other datasets).
+### Results
+The following results are produced by the scripts under [`ood_bench/examples`](ood_bench/examples), all being automatically calibrated.
 
-### Extend OoD-Bench
+| Dataset           | Diversity shift   | Correlation shift |
+| ----------------- | ----------------- | ----------------- |
+| PACS              | 0.6715 ± 0.0392*  | 0.0338 ± 0.0156*  |
+| Office-Home       | 0.0657 ± 0.0147*  | 0.0699 ± 0.0280*  |
+| Terra Incognita   | 0.9846 ± 0.0935*  | 0.0002 ± 0.0003*  |
+| DomainNet         | 0.3740 ± 0.0343*  | 0.1061 ± 0.0181*  |
+| WILDS-Camelyon    | 0.9632 ± 0.1907   | 0.0000 ± 0.0000   |
+| Colored MNIST     | 0.0013 ± 0.0006   | 0.5468 ± 0.0278   |
+| CelebA            | 0.0031 ± 0.0017   | 0.1868 ± 0.0530   |
+| NICO              | 0.0176 ± 0.0158   | 0.1968 ± 0.0888   |
+| ImageNet-A †      | 0.0435 ± 0.0123   | 0.0222 ± 0.0192   |
+| ImageNet-R †      | 0.1024 ± 0.0188   | 0.1180 ± 0.0311   |
+| ImageNet-V2 †     | 0.0079 ± 0.0017   | 0.2362 ± 0.0607   |
 
-#### Experiment with other datasets
-New datasets must first be added to `external/DomainBed/domainbed/datasets.py` as a subclass of `MultipleDomainDataset`, for example:
+<small>\* [averaged](https://github.com/m-Just/OoD-Bench/blob/2140093fee982b19f122de2f198ec5831442daad/ood_bench/scripts/summarize.py#L37) over all leave-out-domain-out splits&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;† with respect to the original ImageNet</small>
+
+**Note:** there is some difference between the results shown above and those reported in our paper mainly because we reworked the original implementation to ease public use and to improve quantification stability.
+One of the main improvements is the use of calibration.
+Previously, the same thresholds that are empirically sound are used across all the datasets studied in our paper (but this may not hold for other datasets).
+
+### Extending OoD-Bench
+
+- **New datasets** must first be added to `external/DomainBed/domainbed/datasets.py` as a subclass of `MultipleDomainDataset`, for example:
 ```python
 class MyDataset(MultipleDomainDataset):
     ENVIRONMENTS = ['env0', 'env1']        # at least two environments
@@ -80,8 +98,7 @@ class MyDataset(MultipleDomainDataset):
         self.num_classes = 2               # required
 ```
 
-#### Experiment with other backbones
-New network backbones must be first added to `ood_bench/networks.py` as a subclass of `Backbone`, for example:
+- **New network backbones** must be first added to `ood_bench/networks.py` as a subclass of `Backbone`, for example:
 ```python
 class MyBackbone(Backbone):
     def __init__(self, hdim, pretrained_model_path=None):