This repository provides download instructions and helper code for the MOTSynth dataset, as well as baseline implementations for object detection, segmentation and tracking.
Check out our:
See docs/INSTALL.md
We adapt torchvision's detection reference code to train Mask R-CNN on MOTSynth. To train Mask R-CNN with a ResNet50 with FPN backbone, you can run the following:
NUM_GPUS=3
PORT=1234
python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS --use_env --master_port=$PORT tools/train_detector.py\
--model maskrcnn_resnet50_fpn\
--batch-size 5 --world-size $NUM_GPUS --trainable-backbone-layers 1 --backbone resnet50 --train-dataset train --epochs 10
If you use a different number of GPUs ($NUM_GPUS
), please adapt your learning rate or modify your batch size so that the overall batch size stays at 15 (3 GPUs with 5 images per GPU).
Our trained model can be downloaded here
We use our Mask R-CNN model trained on MOTSynth to test Tracktor for tracking on MOT17.
To produce results for MOT17 train, you can run the following:
python tools/test_tracktor.py
This model should yield the following results on MOT17 train:
IDF1 IDP IDR Rcll Prcn GT MT PT ML FP FN IDs FM MOTA MOTP IDt IDa IDm
MOT17-02 35.2% 51.7% 26.7% 38.9% 75.4% 62 8 27 27 2361 11353 99 152 25.7% 0.251 28 78 8
MOT17-04 55.5% 65.9% 48.0% 63.2% 86.8% 83 29 33 21 4569 17524 93 245 53.3% 0.204 23 75 5
MOT17-05 62.2% 78.4% 51.6% 59.0% 89.6% 133 30 71 32 473 2834 41 90 51.6% 0.242 29 27 16
MOT17-09 47.4% 51.9% 43.6% 67.0% 79.8% 26 10 15 1 903 1757 51 69 49.1% 0.230 21 34 6
MOT17-10 42.1% 60.1% 32.4% 49.1% 91.1% 57 12 23 22 614 6534 146 326 43.2% 0.240 13 129 4
MOT17-11 57.7% 70.4% 48.9% 63.0% 90.7% 75 23 22 30 607 3491 31 43 56.2% 0.197 7 26 2
MOT17-13 39.9% 64.7% 28.8% 38.4% 86.2% 110 17 47 46 717 7168 88 151 31.5% 0.253 42 67 23
OVERALL 49.7% 63.7% 40.8% 54.9% 85.7% 546 129 238 179 10244 50661 549 1076 45.3% 0.220 163 436 64
We provide a simple baseline for MOTS. We run Tracktor with our trained Mask R-CNN detector, and use Mask R-CNN's segmentation head to produce an segmentation mask for every output bounding box.
To evaluate this model on MOTS20, you can run the following:
python tools/test_tracktor.py mots.do_mots=True mots.mots20_only=True
This model should yield the following results on MOTS20 train:
HOTA IDF1 MOTA
MOTS20-02 39.084 48.942 38.486
MOTS20-05 44.25 58.247 53.607
MOTS20-09 37.661 49.214 54.713
MOTS20-11 52.683 62.015 64.446
COMBINED 44.612 48.691 53.276
We treat MOTSynth and MOT17 as ReID datasets by sampling 1 in 60 frames and treating each pedestrian as a unique identity. We use torchreid's amazing work to train our models.
You can train our baseline ReID model with a ResNet50, on MOTSynth (and evaluate it on MOT17 train) by running:
python tools/main_reid.py --config-file configs/r50_fc512_motsynth_train.yaml
The resulting checkpoint can be downloaded here
This codebase is built on top of several great works. Our detection code is minimally modified from torchvision's detection reference code. For MOT, we directly use Tracktor's codebase, and for ReID, we use the great torchreid framework. Orçun Cetintas also helped with the MOTS postprocesing code. We thank all the authors of these codebases for their amazing work.
If you find MOTSynth useful in your research, please cite our publication:
@inproceedings{fabbri21iccv,
title = {MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?},
author = {Matteo Fabbri and Guillem Bras{\'o} and Gianluca Maugeri and Aljo{\v{s}}a O{\v{s}}ep and Riccardo Gasparini and Orcun Cetintas and Simone Calderara and Laura Leal-Taix{\'e} and Rita Cucchiara},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2021}
}