Skip to content

Commit

Permalink
[Feature] Gesture recognition algorithm MTUT on NVGesture dataset (#1380
Browse files Browse the repository at this point in the history
)

* add nvgesture dataset

* fix nvgesture pipelines

* update gesture datasets

* add ModelSetEpochHook

* nvgesture dataset support multi-GPU evalutation

* add i3d+mtut model

* add nvgesture i3d configs

* webcam add hand detector

* gesture recognition with bbox

* add hand detector config

* fix gesture recognizer init bug

* webcam/gesture - recognizer runs successfully

* delete unnecessary comment

* fix lint error in gesture configs

* add nvgesture category info

* webcam/gesture - display gesture recognition result

* add gesture recognition related docs

* update gesture related comments

* update light hand det model in demo doc

* update gesture recognition configs and results

* auto modify model-index.yml

* stabilize ssa loss in mtut

* add multi-input node comment

* synchronize tools/webcam with master

* add gesture task-name mapping

* move gesture configs to configs/hand/

* fix a bug in demo (#1373)

* update gesture datasets

* add ModelSetEpochHook

* nvgesture dataset support multi-GPU evalutation

* add i3d+mtut model

* add nvgesture i3d configs

* webcam add hand detector

* gesture recognition with bbox

* add hand detector config

* fix gesture recognizer init bug

* webcam/gesture - recognizer runs successfully

* delete unnecessary comment

* fix lint error in gesture configs

* add nvgesture category info

* webcam/gesture - display gesture recognition result

* add gesture recognition related docs

* update gesture related comments

* update light hand det model in demo doc

* update gesture recognition configs and results

* auto modify model-index.yml

* stabilize ssa loss in mtut

* add multi-input node comment

* synchronize tools/webcam with master

* add gesture task-name mapping

* move gesture configs to configs/hand/

* solve conflict in mmdet_modelzoo.md

* add gesture recogition into webcam

* hand gesture inference config explanation

* add gesture recognizer node in __init__.py

* add gesture webcam readme

* Adjust inference tracking min keypoints (#1398)

* Adjust inference tracking min keypoints

* Special case for min_keypoints <= 0 doesn't seem to be required

* remove unnecessary transformer utils (#1405)

* add nvgesture dataset

* fix nvgesture pipelines

* update gesture datasets

* add ModelSetEpochHook

* nvgesture dataset support multi-GPU evalutation

* add i3d+mtut model

* add nvgesture i3d configs

* webcam add hand detector

* gesture recognition with bbox

* add hand detector config

* fix gesture recognizer init bug

* webcam/gesture - recognizer runs successfully

* delete unnecessary comment

* fix lint error in gesture configs

* add nvgesture category info

* webcam/gesture - display gesture recognition result

* add gesture recognition related docs

* update gesture related comments

* update light hand det model in demo doc

* update gesture recognition configs and results

* auto modify model-index.yml

* stabilize ssa loss in mtut

* add multi-input node comment

* synchronize tools/webcam with master

* add gesture task-name mapping

* move gesture configs to configs/hand/

* fix grammer errors in docs

* fix a lint error in doc

* update nvgesture evaluation

* add introduction and assertion to TemporalPooling

* generalize NVGestureRandomFlip

* add unittests for gesture pipelines

* delete duplicated config

* add gesture inference unittest

* add gesture dataset unittest

* fix gesture inference unittest error

* add backbone I3D unittest

* add mtut head unittest

* fix mtut head unittest error

* add gesture recognizer unittest

Co-authored-by: Yining Li <liyining0712@gmail.com>
Co-authored-by: Philipp Allgeuer <5592992+pallgeuer@users.noreply.github.com>
  • Loading branch information
3 people authored Jun 2, 2022
1 parent 39f9dc8 commit d3c17d5
Show file tree
Hide file tree
Showing 52 changed files with 3,692 additions and 31 deletions.
1 change: 1 addition & 0 deletions .dev_scripts/github/update_model_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ def parse_config_path(path):
'3d_kpt_mview_rgb_img': '3D Keypoint',
'3d_kpt_sview_rgb_vid': '3D Keypoint',
'3d_mesh_sview_rgb_img': '3D Mesh',
'gesture_sview_rgbd_vid': 'Gesture',
None: None
}
task_readable = task2readable.get(task)
Expand Down
42 changes: 42 additions & 0 deletions configs/_base_/datasets/nvgesture.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
dataset_info = dict(
dataset_name='nvgesture',
paper_info=dict(
author='Pavlo Molchanov and Xiaodong Yang and Shalini Gupta '
'and Kihwan Kim and Stephen Tyree and Jan Kautz',
title='Online Detection and Classification of Dynamic Hand Gestures '
'with Recurrent 3D Convolutional Neural Networks',
container='Proceedings of the IEEE Conference on '
'Computer Vision and Pattern Recognition',
year='2016',
homepage='https://research.nvidia.com/publication/2016-06_online-'
'detection-and-classification-dynamic-hand-gestures-recurrent-3d',
),
category_info={
0: 'five fingers move right',
1: 'five fingers move left',
2: 'five fingers move up',
3: 'five fingers move down',
4: 'two fingers move right',
5: 'two fingers move left',
6: 'two fingers move up',
7: 'two fingers move down',
8: 'click',
9: 'beckoned',
10: 'stretch hand',
11: 'shake hand',
12: 'one',
13: 'two',
14: 'three',
15: 'lift up',
16: 'press down',
17: 'push',
18: 'shrink',
19: 'levorotation',
20: 'dextrorotation',
21: 'two fingers prod',
22: 'grab',
23: 'thumbs up',
24: 'OK'
},
flip_pairs=[(0, 1), (4, 5), (19, 20)],
fps=30)
7 changes: 7 additions & 0 deletions configs/hand/gesture_sview_rgbd_vid/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Gesture Recognition

Gesture recognition aims to recognize the hand gestures in the video, such as thumbs up.

## Data preparation

Please follow [DATA Preparation](/docs/en/tasks/2d_hand_gesture.md) to prepare data.
8 changes: 8 additions & 0 deletions configs/hand/gesture_sview_rgbd_vid/mtut/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Multi-modal Training and Uni-modal Testing (MTUT) for gesture recognition

MTUT method uses multi-modal data in the training phase, such as RGB videos and depth videos.
For each modality, an I3D network is trained to conduct gesture recognition. The property
of spatial-temporal semantic alignment across multi-modal data is utilized to supervise the
learning, in order to improve the performance of each I3D network for a single modality.

In the testing phase, uni-modal data, generally RGB video, is used.
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
<!-- [ALGORITHM] -->

<details>
<summary align="right"><a href="https://openaccess.thecvf.com/content_CVPR_2019/html/Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.html">MTUT (CVPR'2019)</a></summary>

```bibtex
@InProceedings{Abavisani_2019_CVPR,
author = {Abavisani, Mahdi and Joze, Hamid Reza Vaezi and Patel, Vishal M.},
title = {Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}
```

</details>

<!-- [BACKBONE] -->

<details>
<summary align="right"><a href="https://openaccess.thecvf.com/content_cvpr_2017/html/Carreira_Quo_Vadis_Action_CVPR_2017_paper.html">I3D (CVPR'2017)</a></summary>

```bibtex
@InProceedings{Carreira_2017_CVPR,
author = {Carreira, Joao and Zisserman, Andrew},
title = {Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {July},
year = {2017}
}
```

</details>

<!-- [DATASET] -->

<details>
<summary align="right"><a href="https://openaccess.thecvf.com/content_cvpr_2016/html/Molchanov_Online_Detection_and_CVPR_2016_paper.html">NVGesture (CVPR'2016)</a></summary>

```bibtex
@InProceedings{Molchanov_2016_CVPR,
author = {Molchanov, Pavlo and Yang, Xiaodong and Gupta, Shalini and Kim, Kihwan and Tyree, Stephen and Kautz, Jan},
title = {Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Network},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}
```

</details>

Results on NVGesture test set

| Arch | Input Size | fps | bbox | AP_rgb | AP_depth | ckpt | log |
| :------------------------------------------------------ | :--------: | :-: | :-------: | :----: | :------: | :-----------------------------------------------------: | :----------------------------------------------------: |
| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15.py)$^\*$ | 112x112 | 15 | $\\surd$ | 0.725 | 0.730 | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-363b5956_20220530.pth) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-20220530.log.json) |
| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_224x224_fps30.py) | 224x224 | 30 | $\\surd$ | 0.782 | 0.811 | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-98a8f288_20220530.pthh) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-20220530.log.json) |
| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py) | 224x224 | 30 | $\\times$ | 0.739 | 0.809 | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-b7abf574_20220530.pth) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-20220530.log.json) |

$^\*$: MTUT supports multi-modal training and uni-modal testing. Model trained with this config can be used to recognize gestures in rgb videos with [inference config](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15_rgb.py).
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Collections:
- Name: MTUT
Paper:
Title: Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition
With Multimodal Training
URL: https://openaccess.thecvf.com/content_CVPR_2019/html/Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.html
README: https://github.com/open-mmlab/mmpose/blob/master/docs/en/papers/algorithms/mtut.md
Models:
- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15.py
In Collection: MTUT
Metadata:
Architecture: &id001
- MTUT
- I3D
Training Data: NVGesture
Name: mtut_i3d_nvgesture_bbox_112x112_fps15
Results:
- Dataset: NVGesture
Metrics:
AP depth: 0.73
AP rgb: 0.725
Task: Hand Gesture
Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-363b5956_20220530.pth
- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_224x224_fps30.py
In Collection: MTUT
Metadata:
Architecture: *id001
Training Data: NVGesture
Name: mtut_i3d_nvgesture_bbox_224x224_fps30
Results:
- Dataset: NVGesture
Metrics:
AP depth: 0.811
AP rgb: 0.782
Task: Hand Gesture
Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-98a8f288_20220530.pthh
- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py
In Collection: MTUT
Metadata:
Architecture: *id001
Training Data: NVGesture
Name: mtut_i3d_nvgesture_224x224_fps30
Results:
- Dataset: NVGesture
Metrics:
AP depth: 0.809
AP rgb: 0.739
Task: Hand Gesture
Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-b7abf574_20220530.pth
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
_base_ = [
'../../../../_base_/default_runtime.py',
'../../../../_base_/datasets/nvgesture.py'
]

checkpoint_config = dict(interval=5)
evaluation = dict(interval=5, metric='AP', save_best='AP_rgb')

optimizer = dict(
type='SGD',
lr=1e-2,
momentum=0.9,
)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(policy='step', gamma=0.1, step=[30, 50])
total_epochs = 75
log_config = dict(interval=10)

custom_hooks_config = [dict(type='ModelSetEpochHook')]

model = dict(
type='GestureRecognizer',
modality=['rgb', 'depth'],
pretrained=dict(
rgb='https://github.com/hassony2/kinetics_i3d_pytorch/'
'raw/master/model/model_rgb.pth',
depth='https://github.com/hassony2/kinetics_i3d_pytorch/'
'raw/master/model/model_rgb.pth',
),
backbone=dict(
rgb=dict(
type='I3D',
in_channels=3,
expansion=1,
),
depth=dict(
type='I3D',
in_channels=1,
expansion=1,
),
),
cls_head=dict(
type='MultiModalSSAHead',
num_classes=25,
),
train_cfg=dict(
beta=2,
lambda_=5e-3,
ssa_start_epoch=61,
),
test_cfg=dict(),
)

data_cfg = dict(
video_size=[320, 240],
modality=['rgb', 'depth'],
)

train_pipeline = [
dict(type='LoadVideoFromFile'),
dict(type='ModalWiseChannelProcess'),
dict(type='CropValidClip'),
dict(type='TemporalPooling', length=64, ref_fps=30),
dict(type='ResizeGivenShortEdge', length=256),
dict(type='RandomAlignedSpatialCrop', length=224),
dict(type='GestureRandomFlip'),
dict(type='MultiModalVideoToTensor'),
dict(
type='VideoNormalizeTensor',
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
dict(
type='Collect', keys=['video', 'label'], meta_keys=['fps',
'modality']),
]

val_pipeline = [
dict(type='LoadVideoFromFile'),
dict(type='ModalWiseChannelProcess'),
dict(type='CropValidClip'),
dict(type='TemporalPooling', length=-1, ref_fps=30),
dict(type='ResizeGivenShortEdge', length=256),
dict(type='CenterSpatialCrop', length=224),
dict(type='MultiModalVideoToTensor'),
dict(
type='VideoNormalizeTensor',
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
dict(
type='Collect', keys=['video', 'label'], meta_keys=['fps',
'modality']),
]

test_pipeline = val_pipeline

data_root = 'data/nvgesture'
data = dict(
samples_per_gpu=6,
workers_per_gpu=2,
val_dataloader=dict(samples_per_gpu=6),
test_dataloader=dict(samples_per_gpu=6),
train=dict(
type='NVGestureDataset',
ann_file=f'{data_root}/annotations/'
'nvgesture_train_correct_cvpr2016_v2.lst',
vid_prefix=f'{data_root}/',
data_cfg=data_cfg,
pipeline=train_pipeline,
dataset_info={{_base_.dataset_info}}),
val=dict(
type='NVGestureDataset',
ann_file=f'{data_root}/annotations/'
'nvgesture_test_correct_cvpr2016_v2.lst',
vid_prefix=f'{data_root}/',
data_cfg=data_cfg,
pipeline=val_pipeline,
test_mode=True,
dataset_info={{_base_.dataset_info}}),
test=dict(
type='NVGestureDataset',
ann_file=f'{data_root}/annotations/'
'nvgesture_test_correct_cvpr2016_v2.lst',
vid_prefix=f'{data_root}/',
data_cfg=data_cfg,
pipeline=test_pipeline,
test_mode=True,
dataset_info={{_base_.dataset_info}}))
Loading

0 comments on commit d3c17d5

Please sign in to comment.