-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Feature] Gesture recognition algorithm MTUT on NVGesture dataset (#1380
) * add nvgesture dataset * fix nvgesture pipelines * update gesture datasets * add ModelSetEpochHook * nvgesture dataset support multi-GPU evalutation * add i3d+mtut model * add nvgesture i3d configs * webcam add hand detector * gesture recognition with bbox * add hand detector config * fix gesture recognizer init bug * webcam/gesture - recognizer runs successfully * delete unnecessary comment * fix lint error in gesture configs * add nvgesture category info * webcam/gesture - display gesture recognition result * add gesture recognition related docs * update gesture related comments * update light hand det model in demo doc * update gesture recognition configs and results * auto modify model-index.yml * stabilize ssa loss in mtut * add multi-input node comment * synchronize tools/webcam with master * add gesture task-name mapping * move gesture configs to configs/hand/ * fix a bug in demo (#1373) * update gesture datasets * add ModelSetEpochHook * nvgesture dataset support multi-GPU evalutation * add i3d+mtut model * add nvgesture i3d configs * webcam add hand detector * gesture recognition with bbox * add hand detector config * fix gesture recognizer init bug * webcam/gesture - recognizer runs successfully * delete unnecessary comment * fix lint error in gesture configs * add nvgesture category info * webcam/gesture - display gesture recognition result * add gesture recognition related docs * update gesture related comments * update light hand det model in demo doc * update gesture recognition configs and results * auto modify model-index.yml * stabilize ssa loss in mtut * add multi-input node comment * synchronize tools/webcam with master * add gesture task-name mapping * move gesture configs to configs/hand/ * solve conflict in mmdet_modelzoo.md * add gesture recogition into webcam * hand gesture inference config explanation * add gesture recognizer node in __init__.py * add gesture webcam readme * Adjust inference tracking min keypoints (#1398) * Adjust inference tracking min keypoints * Special case for min_keypoints <= 0 doesn't seem to be required * remove unnecessary transformer utils (#1405) * add nvgesture dataset * fix nvgesture pipelines * update gesture datasets * add ModelSetEpochHook * nvgesture dataset support multi-GPU evalutation * add i3d+mtut model * add nvgesture i3d configs * webcam add hand detector * gesture recognition with bbox * add hand detector config * fix gesture recognizer init bug * webcam/gesture - recognizer runs successfully * delete unnecessary comment * fix lint error in gesture configs * add nvgesture category info * webcam/gesture - display gesture recognition result * add gesture recognition related docs * update gesture related comments * update light hand det model in demo doc * update gesture recognition configs and results * auto modify model-index.yml * stabilize ssa loss in mtut * add multi-input node comment * synchronize tools/webcam with master * add gesture task-name mapping * move gesture configs to configs/hand/ * fix grammer errors in docs * fix a lint error in doc * update nvgesture evaluation * add introduction and assertion to TemporalPooling * generalize NVGestureRandomFlip * add unittests for gesture pipelines * delete duplicated config * add gesture inference unittest * add gesture dataset unittest * fix gesture inference unittest error * add backbone I3D unittest * add mtut head unittest * fix mtut head unittest error * add gesture recognizer unittest Co-authored-by: Yining Li <liyining0712@gmail.com> Co-authored-by: Philipp Allgeuer <5592992+pallgeuer@users.noreply.github.com>
- Loading branch information
1 parent
39f9dc8
commit d3c17d5
Showing
52 changed files
with
3,692 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
dataset_info = dict( | ||
dataset_name='nvgesture', | ||
paper_info=dict( | ||
author='Pavlo Molchanov and Xiaodong Yang and Shalini Gupta ' | ||
'and Kihwan Kim and Stephen Tyree and Jan Kautz', | ||
title='Online Detection and Classification of Dynamic Hand Gestures ' | ||
'with Recurrent 3D Convolutional Neural Networks', | ||
container='Proceedings of the IEEE Conference on ' | ||
'Computer Vision and Pattern Recognition', | ||
year='2016', | ||
homepage='https://research.nvidia.com/publication/2016-06_online-' | ||
'detection-and-classification-dynamic-hand-gestures-recurrent-3d', | ||
), | ||
category_info={ | ||
0: 'five fingers move right', | ||
1: 'five fingers move left', | ||
2: 'five fingers move up', | ||
3: 'five fingers move down', | ||
4: 'two fingers move right', | ||
5: 'two fingers move left', | ||
6: 'two fingers move up', | ||
7: 'two fingers move down', | ||
8: 'click', | ||
9: 'beckoned', | ||
10: 'stretch hand', | ||
11: 'shake hand', | ||
12: 'one', | ||
13: 'two', | ||
14: 'three', | ||
15: 'lift up', | ||
16: 'press down', | ||
17: 'push', | ||
18: 'shrink', | ||
19: 'levorotation', | ||
20: 'dextrorotation', | ||
21: 'two fingers prod', | ||
22: 'grab', | ||
23: 'thumbs up', | ||
24: 'OK' | ||
}, | ||
flip_pairs=[(0, 1), (4, 5), (19, 20)], | ||
fps=30) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Gesture Recognition | ||
|
||
Gesture recognition aims to recognize the hand gestures in the video, such as thumbs up. | ||
|
||
## Data preparation | ||
|
||
Please follow [DATA Preparation](/docs/en/tasks/2d_hand_gesture.md) to prepare data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Multi-modal Training and Uni-modal Testing (MTUT) for gesture recognition | ||
|
||
MTUT method uses multi-modal data in the training phase, such as RGB videos and depth videos. | ||
For each modality, an I3D network is trained to conduct gesture recognition. The property | ||
of spatial-temporal semantic alignment across multi-modal data is utilized to supervise the | ||
learning, in order to improve the performance of each I3D network for a single modality. | ||
|
||
In the testing phase, uni-modal data, generally RGB video, is used. |
60 changes: 60 additions & 0 deletions
60
configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
<!-- [ALGORITHM] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://openaccess.thecvf.com/content_CVPR_2019/html/Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.html">MTUT (CVPR'2019)</a></summary> | ||
|
||
```bibtex | ||
@InProceedings{Abavisani_2019_CVPR, | ||
author = {Abavisani, Mahdi and Joze, Hamid Reza Vaezi and Patel, Vishal M.}, | ||
title = {Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training}, | ||
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, | ||
month = {June}, | ||
year = {2019} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
<!-- [BACKBONE] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://openaccess.thecvf.com/content_cvpr_2017/html/Carreira_Quo_Vadis_Action_CVPR_2017_paper.html">I3D (CVPR'2017)</a></summary> | ||
|
||
```bibtex | ||
@InProceedings{Carreira_2017_CVPR, | ||
author = {Carreira, Joao and Zisserman, Andrew}, | ||
title = {Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset}, | ||
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, | ||
month = {July}, | ||
year = {2017} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
<!-- [DATASET] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://openaccess.thecvf.com/content_cvpr_2016/html/Molchanov_Online_Detection_and_CVPR_2016_paper.html">NVGesture (CVPR'2016)</a></summary> | ||
|
||
```bibtex | ||
@InProceedings{Molchanov_2016_CVPR, | ||
author = {Molchanov, Pavlo and Yang, Xiaodong and Gupta, Shalini and Kim, Kihwan and Tyree, Stephen and Kautz, Jan}, | ||
title = {Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Network}, | ||
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, | ||
month = {June}, | ||
year = {2016} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
Results on NVGesture test set | ||
|
||
| Arch | Input Size | fps | bbox | AP_rgb | AP_depth | ckpt | log | | ||
| :------------------------------------------------------ | :--------: | :-: | :-------: | :----: | :------: | :-----------------------------------------------------: | :----------------------------------------------------: | | ||
| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15.py)$^\*$ | 112x112 | 15 | $\\surd$ | 0.725 | 0.730 | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-363b5956_20220530.pth) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-20220530.log.json) | | ||
| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_224x224_fps30.py) | 224x224 | 30 | $\\surd$ | 0.782 | 0.811 | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-98a8f288_20220530.pthh) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-20220530.log.json) | | ||
| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py) | 224x224 | 30 | $\\times$ | 0.739 | 0.809 | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-b7abf574_20220530.pth) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-20220530.log.json) | | ||
|
||
$^\*$: MTUT supports multi-modal training and uni-modal testing. Model trained with this config can be used to recognize gestures in rgb videos with [inference config](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15_rgb.py). |
49 changes: 49 additions & 0 deletions
49
configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
Collections: | ||
- Name: MTUT | ||
Paper: | ||
Title: Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition | ||
With Multimodal Training | ||
URL: https://openaccess.thecvf.com/content_CVPR_2019/html/Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.html | ||
README: https://github.com/open-mmlab/mmpose/blob/master/docs/en/papers/algorithms/mtut.md | ||
Models: | ||
- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15.py | ||
In Collection: MTUT | ||
Metadata: | ||
Architecture: &id001 | ||
- MTUT | ||
- I3D | ||
Training Data: NVGesture | ||
Name: mtut_i3d_nvgesture_bbox_112x112_fps15 | ||
Results: | ||
- Dataset: NVGesture | ||
Metrics: | ||
AP depth: 0.73 | ||
AP rgb: 0.725 | ||
Task: Hand Gesture | ||
Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-363b5956_20220530.pth | ||
- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_224x224_fps30.py | ||
In Collection: MTUT | ||
Metadata: | ||
Architecture: *id001 | ||
Training Data: NVGesture | ||
Name: mtut_i3d_nvgesture_bbox_224x224_fps30 | ||
Results: | ||
- Dataset: NVGesture | ||
Metrics: | ||
AP depth: 0.811 | ||
AP rgb: 0.782 | ||
Task: Hand Gesture | ||
Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-98a8f288_20220530.pthh | ||
- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py | ||
In Collection: MTUT | ||
Metadata: | ||
Architecture: *id001 | ||
Training Data: NVGesture | ||
Name: mtut_i3d_nvgesture_224x224_fps30 | ||
Results: | ||
- Dataset: NVGesture | ||
Metrics: | ||
AP depth: 0.809 | ||
AP rgb: 0.739 | ||
Task: Hand Gesture | ||
Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-b7abf574_20220530.pth |
128 changes: 128 additions & 0 deletions
128
configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
_base_ = [ | ||
'../../../../_base_/default_runtime.py', | ||
'../../../../_base_/datasets/nvgesture.py' | ||
] | ||
|
||
checkpoint_config = dict(interval=5) | ||
evaluation = dict(interval=5, metric='AP', save_best='AP_rgb') | ||
|
||
optimizer = dict( | ||
type='SGD', | ||
lr=1e-2, | ||
momentum=0.9, | ||
) | ||
optimizer_config = dict(grad_clip=None) | ||
# learning policy | ||
lr_config = dict(policy='step', gamma=0.1, step=[30, 50]) | ||
total_epochs = 75 | ||
log_config = dict(interval=10) | ||
|
||
custom_hooks_config = [dict(type='ModelSetEpochHook')] | ||
|
||
model = dict( | ||
type='GestureRecognizer', | ||
modality=['rgb', 'depth'], | ||
pretrained=dict( | ||
rgb='https://github.com/hassony2/kinetics_i3d_pytorch/' | ||
'raw/master/model/model_rgb.pth', | ||
depth='https://github.com/hassony2/kinetics_i3d_pytorch/' | ||
'raw/master/model/model_rgb.pth', | ||
), | ||
backbone=dict( | ||
rgb=dict( | ||
type='I3D', | ||
in_channels=3, | ||
expansion=1, | ||
), | ||
depth=dict( | ||
type='I3D', | ||
in_channels=1, | ||
expansion=1, | ||
), | ||
), | ||
cls_head=dict( | ||
type='MultiModalSSAHead', | ||
num_classes=25, | ||
), | ||
train_cfg=dict( | ||
beta=2, | ||
lambda_=5e-3, | ||
ssa_start_epoch=61, | ||
), | ||
test_cfg=dict(), | ||
) | ||
|
||
data_cfg = dict( | ||
video_size=[320, 240], | ||
modality=['rgb', 'depth'], | ||
) | ||
|
||
train_pipeline = [ | ||
dict(type='LoadVideoFromFile'), | ||
dict(type='ModalWiseChannelProcess'), | ||
dict(type='CropValidClip'), | ||
dict(type='TemporalPooling', length=64, ref_fps=30), | ||
dict(type='ResizeGivenShortEdge', length=256), | ||
dict(type='RandomAlignedSpatialCrop', length=224), | ||
dict(type='GestureRandomFlip'), | ||
dict(type='MultiModalVideoToTensor'), | ||
dict( | ||
type='VideoNormalizeTensor', | ||
mean=[0.485, 0.456, 0.406], | ||
std=[0.229, 0.224, 0.225]), | ||
dict( | ||
type='Collect', keys=['video', 'label'], meta_keys=['fps', | ||
'modality']), | ||
] | ||
|
||
val_pipeline = [ | ||
dict(type='LoadVideoFromFile'), | ||
dict(type='ModalWiseChannelProcess'), | ||
dict(type='CropValidClip'), | ||
dict(type='TemporalPooling', length=-1, ref_fps=30), | ||
dict(type='ResizeGivenShortEdge', length=256), | ||
dict(type='CenterSpatialCrop', length=224), | ||
dict(type='MultiModalVideoToTensor'), | ||
dict( | ||
type='VideoNormalizeTensor', | ||
mean=[0.485, 0.456, 0.406], | ||
std=[0.229, 0.224, 0.225]), | ||
dict( | ||
type='Collect', keys=['video', 'label'], meta_keys=['fps', | ||
'modality']), | ||
] | ||
|
||
test_pipeline = val_pipeline | ||
|
||
data_root = 'data/nvgesture' | ||
data = dict( | ||
samples_per_gpu=6, | ||
workers_per_gpu=2, | ||
val_dataloader=dict(samples_per_gpu=6), | ||
test_dataloader=dict(samples_per_gpu=6), | ||
train=dict( | ||
type='NVGestureDataset', | ||
ann_file=f'{data_root}/annotations/' | ||
'nvgesture_train_correct_cvpr2016_v2.lst', | ||
vid_prefix=f'{data_root}/', | ||
data_cfg=data_cfg, | ||
pipeline=train_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
val=dict( | ||
type='NVGestureDataset', | ||
ann_file=f'{data_root}/annotations/' | ||
'nvgesture_test_correct_cvpr2016_v2.lst', | ||
vid_prefix=f'{data_root}/', | ||
data_cfg=data_cfg, | ||
pipeline=val_pipeline, | ||
test_mode=True, | ||
dataset_info={{_base_.dataset_info}}), | ||
test=dict( | ||
type='NVGestureDataset', | ||
ann_file=f'{data_root}/annotations/' | ||
'nvgesture_test_correct_cvpr2016_v2.lst', | ||
vid_prefix=f'{data_root}/', | ||
data_cfg=data_cfg, | ||
pipeline=test_pipeline, | ||
test_mode=True, | ||
dataset_info={{_base_.dataset_info}})) |
Oops, something went wrong.