Set-up

Dependencies

This codebase has been tested with the packages and versions specified in requirements.txt and Python 3.8.

We recommend creating a new conda virtual environment:

conda create -n multimae python=3.8 -y
conda activate multimae

Then, install PyTorch 1.10.0+ and torchvision 0.11.1+. For example:

conda install pytorch=1.10.0 torchvision=0.11.1 -c pytorch -y

Finally, install all other required packages:

pip install timm==0.4.12 einops==0.3.2 pandas==1.3.4 albumentations==1.1.0 wandb==0.12.11

ℹ️ If data loading and image transforms are the bottleneck, consider replacing Pillow with Pillow-SIMD and compiling it with libjpeg-turbo. You can find a detailed guide on how to do this here or use the provided script:

sh tools/install_pillow_simd.sh

Dataset Preparation

Dataset structure

For simplicity and uniformity, all our datasets are structured in the following way:

/path/to/data/
├── train/
│   ├── modality1/
│   │   └── subfolder1/
│   │       ├── img1.ext1
│   │       └── img2.ext1
│   └── modality2/
│       └── subfolder1/
│           ├── img1.ext2
│           └── img2.ext2
└── val/
    ├── modality1/
    │   └── subfolder2/
    │       ├── img3.ext1
    │       └── img4.ext1
    └── modality2/
        └── subfolder2/
            ├── img3.ext2
            └── img4.ext2

The folder structure and filenames should match across modalities. If a dataset does not have specific subfolders, a generic subfolder name can be used instead (e.g., all/).

For most experiments, we use RGB (rgb), depth (depth), and semantic segmentation (semseg) as our modalities.

RGB images are stored as either PNG or JPEG images. Depth maps are stored as either single-channel JPX or single-channel PNG images. Semantic segmentation maps are stored as single-channel PNG images.

Datasets

We use the following datasets in our experiments:

ImageNet-1K
ADE20K
Hypersim
NYUv2
Taskonomy

To download these datasets, please follow the instructions on their respective pages. To prepare the NYUv2 dataset, we recommend using the provided prepare_nyuv2.py script.

Downloadable ImageNet-1K pseudo labels

We publish links to download the Omnidata depth and COCO semantic segmentation pseudo labels here. The images for each ImageNet class are stored as tar-files.

To download the dataset, we recommend using aria2c, which you can install using:

sudo apt-get update
sudo apt-get install aria2

Download both train and validation splits for the depth and semantic segmentation labels by calling

aria2c --input-file ./tools/pseudolabel_links/all_aria2c.txt -d /the/download/directory -j 16 -x 16

For additional download options, please see the aria2c documentation.

Please note that by downloading this dataset you are consenting to non-commercial use and the license.

Pseudo labeling networks

ℹ️ The MultiMAE pre-training strategy is flexible and can benefit from higher quality pseudo labels and ground truth data. So feel free to use different pseudo labeling networks and datasets than the ones we used!

We use two off-the-shelf networks to pseudo label the ImageNet-1K dataset.

Depth estimation: We use a DPT with a ViT-B-Hybrid backbone pre-trained on the Omnidata dataset. You can find installation instructions and pre-trained weights for this model here.
Semantic segmentation: We use a Mask2Former with a Swin-S backbone pre-trained on the COCO dataset. You can find installation instructions and pre-trained weights for this model here.

For an example of how to use these networks for pseudo labeling, please take a look at our Colab notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SETUP.md

SETUP.md

Set-up

Dependencies

Dataset Preparation

Dataset structure

Datasets

Downloadable ImageNet-1K pseudo labels

Pseudo labeling networks

Files

SETUP.md

Latest commit

History

SETUP.md

File metadata and controls

Set-up

Dependencies

Dataset Preparation

Dataset structure

Datasets

Downloadable ImageNet-1K pseudo labels

Pseudo labeling networks