Codebase for "We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline" (TMLR 2024).
This repo is built off of mmsegmentation, with the MIC repo
Modification of these instructions.
conda create -n mic python=3.8.5
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
- Install my mmcv from scratch. full instructions if necessary
git submodule update --recursive
will pull my mmcv submodule- Simply run
MMCV_WITH_OPS=1 pip install -e . -v
inside thesubmodules/mmcv
directory
pip install -e .
inside mmseg root dir
Please download the following datasets, which will be used in Video-DAS experiments. Download to mmseg/datasets
with the following structure and key folders
datasets/
├── cityscapes-seq
│ ├── gtFine % gtFine_trainvaltest.zip
│ ├── leftImg8bit_sequence % gtFine_trainvaltest.zip
└── VIPER
├── val
├── test
├── train
└── img % Images: Frames: *0, *1, *[2-9]; Sequences: 01-77; Format: jpg
└── cls % Semantic Class Labels: Frames: *0, *1, *[2-9]; Sequences: 01-77; Format: png
└── SynthiaSeq
└── SYNTHIA-SEQS-04-DAWN
└── RGB
└── GT
Download Links:
After downlaoding all datasets, we must generate sample class statistics on our source datasets (Viper, Synthia-Seq) and convert class labels into Cityscapes-Seq classes.
For both Viper and Synthia-Seq, perform the following:
python tools/convert_datasets/viper.py datasets/viper --gt-dir train/cls/
python tools/convert_datasets/synthiaSeq.py datasets/SynthiaSeq/SYNTHIA-SEQS-04-DAWN --gt-dir GT/LABELS/Stereo_Left/Omni_F
We introduce support for a new target domain dataset derived from BDD10k. BDD10k has a series of 10,000 driving images across a variety of conditions. Of these 10,000 images, we identify 3,429 with valid corresponding video clips in the BDD100k dataset, making this subset suitable for Video-DAS. We refer to this subset as BDDVid. Next, we split these 3,429 images into 2,999 train samples and 430 evaluation samples. In BDD10k, the labeled frame is generally the 10th second in the 40-second clip, but not always. To mitigate this, we ultimately only evaluate images in BDD10k that perfectly correspond with the segmentation annotation, while at training time we use frames directly extracted from BDD100k video clips.
The following instructions below will give detail in how to set up BDDVid Dataset.
-
Download Segmentation Labels for BDD10k (https://bdd-data.berkeley.edu/portal.html#download) images.
-
**Download all BDD100k video parts: **
cd datasets/BDDVid/setup/download python download.py --file_name bdd100k_video_links.txt
Note: Make sure to specify the correct output directory in
download.py
for where you want the video zips to be stored. -
Unzip all video files
cd ../unzip python unzip.py
Note: Make sure to specify the directory for where the video zips are stored and the output directory for where files should be unzipped in
unzip.py
-
Unpack each video sequence and extract the corresponding frame
cd ../unpack_video
Create a text file with paths to each video unzipped. Refer to
video_path_train.txt
andvideo_path_val.txt
as an example.python unpack_video.py
Note: You will run the script twice, based on the split we are unpacking for (train or val). Edit the
split
varibale to specify train or val, and thefile_path
variable, which refers to the list of all video paths for the given split.Also, note that through experimentation and analysis, we determined frame 307 in the videos is the closest to the images in the BDD10k dataset. We deal with the possible slight label mismatch problem in later steps to counter this issue.
-
Download BDD10k ("10k Images") and its labels ("Segmentation" tab), and unzip them.
-
Copy Segmentation labels for train and val in BDDVid
cd ../bdd10k_img_labels python grab_labels.py
Note: Run this 2 times for each split (train, val). Edit the
orig_dataset
with the path to the original BDD10k dataset train split, which was downlaoded in step 5. -
Fix Image-Label Mismatch
We will be creating 2 new folders to deal with the image-label mismatch at frame 307 described in step (4).
(1)
train_orig_10k
- same as train, but the 307 frame is from the original BDD10k dataset. Use this directory for supervised BDD jobs(2)
val_orig_10k
- same as val, but the 307 frame is from the original BDD10k dataset. ALWAYS use this split, as we want to compute validation over the actual image and label.python get_orig_images.py
Note: Run this 2 times for each split (train, val). Edit the
orig_dataset
with the path to the original BDD10k dataset train spit, which was downloaded in step 5.
BDDVid is finally setup! For UDA jobs, use the train
and val_orig_10k
split. For supervised jobs with BDDVid, use train_orig_10k
and val_orig_10k
.
A number of our methods rely on optical flow between successive frames, thus for each dataset, we generated flows using FlowFormer. We have hosted all our generated flows for each dataset on Hugging Face.
Simply run
git lfs install
git clone https://huggingface.co/datasets/hoffman-lab/Unified-VideoDA-Generated-Flows
This will produce the following file tree
Unified-VideoDA-Generated-Flows/
├── SynthiaSeq_Flows
│ └── frame_dist_1
│ └── im
│ ├── synthiaSeq_im_backward_flow.tar.gz
│ ├── synthiaSeq_im_forward_flow.tar.gz
├── BDDVid_Flows
│ └── frame_dist_2
│ ├── imtk
│ │ └── bddvid_imtk_backward_flow.tar.gz
│ └── im
│ └── bddvid_im_backward_flow.tar.gz
├── Viper_Flows
│ └── frame_dist_1
│ ├── imtk
│ │ └── viper_imtk_backward_flow.tar.gz
│ └── im
│ ├── viper_im_forward_flow.tar.gz
│ └── viper_im_backward_flow.tar.gz
├── CityscapesSeq_Flows
│ └── frame_dist_1
│ ├── imtk
│ │ ├── csSeq_imtk_forward_flow.tar.gz
│ │ └── csSeq_imtk_backward_flow.tar.gz
│ └── im
│ ├── csSeq_im_backward_flow.tar.gz
│ └── csSeq_im_forward_flow.tar.gz
Finally unpack each tar file. For instance:
cd Unified-VideoDA-Generated-Flows/SynthiaSeq_Flows/frame_dist_1/im
tar -xvzf synthiaSeq_im_backward_flow.tar.gz.tar.gz
See ./experiments.md
for commands to run any experiment in the paper. The HRDA baseline can be run via python tools/train.py configs/mic/viperHR2bddHR_mic_hrda.py --launcher=slurm --l-warp-lambda=0.0 --l-mix-lambda=1.0 --seed 1 --deterministic --work-dir=./work_dirs/<dirname> --nowandb True
We have made a number of key contributions to this open source mmsegmentation repo to support video domain adaptative segmentation experiments for future researchers to build off of.
Firstly, we consolidated both ImageDA and VideoDA techniques into the mmsegmentation repository. This enables researchers to easily switch between models, backbones, segmentation heads, and architectures.
We include key datasets for the VideoDA benchmark (ViperSeq -> CityscapesSeq, SynthiaSeq -> CityscapesSeq) to mmsegmentation, along with our own constructed shift (ViperSeq -> BDDVid, SynthiaSeq -> BDDVid), and allowed for the capability of loading consecutive images along with the corresponding optical flow based on a frame distance specified. This enables researchers to easily start work on VideoDA related problems or benchmark current ImageDA appraoches on this setting.
In additon, we provide implementations of common VideoDA techniques such as Video Discriminators, ACCEL architectures + consistent mixup, and a variety of pseudo-label refinement strategies.
All experiments we report in our paper have been made avaiabile in the repository, with each experiment's corresponding bash script to help with reproducability.
The following files are where key changes were made:
VideoDA Dataset Support
mmseg/datasets/viperSeq.py
mmseg/datasets/cityscapesSeq.py
mmseg/datasets/SynthiaSeq.py
mmseg/datasets/SynthiaSeq.py
mmseg/datasets/bddSeq.py
Consecutive Frame/Optical Flow Support
mmseg/datasets/seqUtils.py
tools/aggregate_flows/flow/my_utils.py
tools/aggregate_flows/flow/util_flow.py
VideoDA techinques
- Video Discriminator:
mmseg/models/uda/dacsAdvseg.py
- PL Refinement:
mmseg/models/uda/dacs.py
- ACCEL + Consistent Mixup:
mmseg/models/segmentors/accel_hrda_encoder_decoder.py
mmseg/models/utils/dacs_transforms.py
Dataset and Model Configurations
configs/_base_/datasets/*
configs/mic/*
Experiment Scripts
tools/experiments/*
@inproceedings{kareer2024NotUsingVideosCorrectly
title={We're Not Using Videos Effectively: An Updated Video Domain Adaptation Baseline},
author={Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvi Chattopadhyay, Judy Hoffman, Viraj Prabhu},
booktitle={Transactions on Machine Learning Research (TMLR)},
year={2024}
}