[TPAMI] A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition
This repo is the official implementation of "A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition" as well as the follow-ups. It is an extension of the CVPR'23 paper. It currently includes code and models for the following tasks:
RGB-D-based Action Recognition: Included in this repo.
RGB-D-based Gesture Recognition: Included in this repo.
Video data augementation: Included in this repo. See the ShuffleMix+ strategy in this paper.
2024/2/28
- To reuse UMDR-Net on a new action recognition dataset, one may refer to this guide.
2023/10/06
- This method achieved the 4th Place in the ICIAP Multimodal Action Recognition Competition.
2023/07/29
- Uploaded the training results on the NTU-RGBD-120 dataset.
2023/06/20
- Add dataset split files.
- Fixed some bugs.
- Update README.
- Visualization of the class activation responses for RGB and depth modalities.
The proposed method (UMDR) outperforms a number of state-of-the-art methods on both action and gesture datasets.
This is a PyTorch implementation of our paper. torch>=1.7.0; torchvision>=0.8.0; Visdom(optional)
Data prepare: dataset with the following folder structure:
│NTURGBD/
├──dataset_splits/
│ ├── @CS
│ │ ├── train.txt
video name total frames label
│ │ │ ├──S001C001P001R001A001_rgb 103 0
│ │ │ ├──S001C001P001R001A004_rgb 99 3
│ │ │ ├──......
│ │ ├── valid.txt
│ ├── @CV
│ │ ├── train.txt
│ │ ├── valid.txt
├──ImagesResize/
│ │ ├── S001C002P001R001A002_rgb
│ │ │ ├──000000.jpg
│ │ │ ├──000001.jpg
│ │ │ ├──......
├──nturgb+d_depth_masked/
│ │ ├── S001C002P001R001A002
│ │ │ ├──MDepth-00000000.png
│ │ │ ├──MDepth-00000001.png
│ │ │ ├──......
NOTE: We use the NTU dataset’s high-resolution RGB video (1280x960). To avoid losing infoemation, we do not resize the video frames directly to 320x240. Instead, we crop a 640x480 ROI area for each frame using the realeased mask images. Then we resize the cropped area to 320x240 for training and testing. See data/data_preprose_for_NTU.py for the data preprocessing codes.
We propose to decouple and recouple spatiotemporal representation for RGB-D-based motion recognition. The Figure in the first line illustrates the proposed multi-modal spatiotemporal representation learning framework. The Figure in the second line shows the learning of decoupling and multi-stage recoupling saptiotemporal representation from a unimodal data.We pre-trained all of our models on the 20BN Jester V1 dataset, except for NTU-RGBD. Alternatively, one can use the parameters trained on NTU-RGBD to initialize the model before training on other datasets, such as IsoGD, NvGesture and THU-READ.
Take training an RGB model with 8 GPUs on the NTU-RGBD dataset as an example,
# type: M(rgb), K(depth); sample-duration: the length of the video clip; smprob: hyperparameter $\rho$; mixup: hyperparameter $\alpha_{m}$; shufflemix: $\alpha_{s}$; intar-fatcer: Controls the temporal resolution of each sub-branch in DTN (default: set 2 when sample-duration=16/32; set 4 when sample-duration=64).
python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 --use_env train.py --config config/NTU.yml --data /path/to/Dataset/NTU-RGBD/frames --splits /path/to/Dataset/NTU-RGBD/dataset_splits/@CS/ --save ./output_dir/ --batch-size 16 --sample-duration 32 \
--opt sgd \
--lr 0.01 \
--sched cosine \
--smprob 0.2 --mixup 0.8 --shufflemix 0.3 --epochs 100 --distill 0.2 --type M --intar-fatcer 2
Take training an RGB model with 8 GPUs on the IsoGD dataset as an example,
python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 --use_env train.py --config config/IsoGD.yml --data /path/to/Dataset/IsoGD/frames --splits ./data/dataset_splits/IsoGD/rgb/ --save ./output_dir/ --batch-size 16 --sample-duration 32 \
--opt sgd \
--lr 0.01 \
--sched cosine \
--smprob 0.2 --mixup 0.8 --shufflemix 0.3 --epochs 100 --distill 0.2 --type M --intar-fatcer 2 \
--finetune ./Checkpoints/NTU-RGBD-32-DTNV2-TSM/model_best.pth.tar
# scc-depth: number of CFCer used in spatial domain. tcc-depth: number of CFCer used in temporal domain.
python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 --use_env train_fusion.py --config config/NTU.yml --data /path/to/Dataset/NTU-RGBD/frames --splits /path/to/Dataset/NTU-RGBD/dataset_splits/@CS/ --save ./output_dir/ --batch-size 16 --sample-duration 32 \
--smprob 0.2 --mixup 0.8 --shufflemix 0.3 --epochs 100 --distill 0.0 --intar-fatcer 2 \
--FusionNet cs32 --lr 0.01 --sched step --opt sgd --decay-epochs 10 --scc-depth 2 --tcc-depth 4 --type rgbd
python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env score_fusion.py --config config/NTU.yml --data /path/to/Dataset/NTU-RGBD/frames --splits /path/to/Dataset/NTU-RGBD/dataset_splits/@CS/ --save ./output_dir/ --batch-size 16 --sample-duration 32 --intar-fatcer 2 --FusionNet cs32 --type rgbd
python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 --use_env train.py --config config/NTU.yml --data /path/to/Dataset/NTU-RGBD/frames --splits /path/to/Dataset/NTU-RGBD/dataset_splits/@CS/ --batch-size 16 --sample-duration 32 --eval_only --resume /path/to/model_best.pth.tar
Dataset | Modality | #Frames | Accuracy | Download |
---|---|---|---|---|
NTU-RGBD(CS) | RGB | 16/32/64 | 92.0/92.2/92.9 | Google Drive |
NTU-RGBD(CS) | Depth | 16/32/64 | 94.5/94.8/95.0 | Google Drive |
NTU-RGBD(CS) | RGB-D | 16/32/64 | 95.6/95.9/96.2 | Google Drive |
NTU-RGBD(CV) | RGB | 16/32/64 | 95.3/95.8/96.3 | Google Drive |
NTU-RGBD(CV) | Depth | 16/32/64 | 95.4/95.9/96.5 | Google Drive |
NTU-RGBD(CV) | RGB-D | 16/32/64 | 97.5/97.8/98.0 | Google Drive |
NTU-RGBD-120(CS) | RGB | 16/32/64 | -/89.8/- | Google Drive |
NTU-RGBD-120(CS) | Depth | 16/32/64 | -/92.6/- | Google Drive |
IsoGD | RGB | 16/32/64 | 60.6/63.7/64.4 | Google Drive |
IsoGD | Depth | 16/32/64 | 63.4/64.6/65.5 | Google Drive |
IsoGD | RGB-D | 16/32/64 | 69.2/72.6/72.7 | Google Drive |
@ARTICLE{zhou2023umdr,
author={Zhou, Benjia and Wang, Pichao and Wan, Jun and Liang, Yanyan and Wang, Fan},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={A Unified Multimodal De- and Re-Coupling Framework for RGB-D Motion Recognition},
year={2023},
pages={1-15},
doi={10.1109/TPAMI.2023.3274783}}
The code is released under the MIT license.
Copyright (C) 2010-2021 Alibaba Group Holding Limited.