Supported datasets

There are four datasets in the SSB: three fine-grained datasets (CUB-200-2011, Stanford Cars, FGVC-Aircraft); and ImageNet. Each dataset comes with pre-defined Known classes and Unknown classes, where Unknown classes are stratified into 'Easy' and 'Hard' based on their semantic similarity to the Known classes.

For the fine-grained datasets, the original classes are split into Known and Unknown subsets. For ImageNet, the ImageNet-1K classes (ILSVRC12 challenge) are used as Known, and specific classes from ImageNet-21K-P are selected as Unknown.

NOTE: For the ImageNet splits, you will need around 0.5TB of free disk space.

NOTE: In some cases, a 'Medium' Unknown split is also specified. During benchmarking, these classes are combined into the 'Hard' split.

Config file

A config file is expected in ~/.ssb/ssb_config.json. It specifies where each dataset should be saved to and read from. If the datasets are already present in the correct format, you can point to them in this config.

There should be one entry for each of: CUB, Stanford Cars, FGVC-Aircraft, ImageNet-1K, ImageNet-21K.

By default, it is:

{
    "cub_directory": "~/data/CUB",
    "aircraft_directory": "~/data/FGVC_Aircraft",
    "scars_directory": "~/data/Stanford_Cars",
    "imagenet_1k_directory": "~/data/ImageNet-1K",
    "imagenet_21k_directory": "~/data/ImageNet-21K"
}

Download and pre-process

To download and pre-process all datasets, you can run the following (note ImageNet-1K and ImageNet-21K must both be downlaoded explicitly):

>> from SSB.download import download_datasets
>> download_datasets(['cub', 'aircraft', 'scars', 'imagenet_1k', 'imagenet_21k'])

Kaggle: For some datasets, you will need a Kaggle account and API key set up (see README). For ImageNet, you will also need to join the competition.

FGVC datasets: These datasets are relative small and should download and pre-process in a few minutes. The datasets require only a few GB of disk space.

ImageNet-1K: This dataset takes a few hours to download and pre-process. It will need 200GB of disk space.

ImageNet-21K: This dataset is downlaoded in parallel for each synset. You can set the number of workers depending on your system bandwidth and specifications. You will need around 0.5TB of disk space.

Annotation Formats:

For each dataset in (CUB, Stanford Cars, FGVC-Aircraft, ImageNet), there is an associated {dataset_name}_ssb_splits.json file in SSB/splits. When loaded, the dictionary has keys of: known_classes, unknown_classes, known_unknown_pairs:

known_classes is a list of all Known class indices
unknown_classes is a itself a dictionary. It contains keys for 'Easy', 'Medium' (if applicable), and 'Hard', with lists of Unknown class indices.
known_unknown_pairs (if applicable): For some datasets, for each Unknown class, we specify the Known class which is semantically most similar.

In SSB/splits, we further include:

index_to_class_name.json. This maps each class index onto a readable class name.
imagenet_21k_val_files.json. For each synset in the Unknown ImageNet splits, this specifies the files which are reserved for validation.

A note on class index formats:

CUB: Class indices are given 0 - 199. These correspond to 1 - 200 in the original dataset.
Stanford Cars: Classes ordered based on the order of cars_meta.mat in the original dataset. Note that this ordering is different to the one recovered by running a standard PyTorch ImageFolder constructor in the root directory.
FGVC-Aircraft: Class indices are given as in the original dataset.
ImageNet: Class indices are specified as WordNet synsets.

Dataset directory formats

Utilities are included to download all datasets in the correct format automatically. Otherwise if you already have the dataset in the correct format, you can point to it in the config.

The formats are usually in the default style for the respective datasets and are detailed below. The exception is ImageNet, which is reformated from the Kaggle version to be compatible with PyTorch ImageFolder.