[Feature] Gesture recognition algorithm MTUT on NVGesture dataset (#1380

) * add nvgesture dataset * fix nvgesture pipelines * update gesture datasets * add ModelSetEpochHook * nvgesture dataset support multi-GPU evalutation * add i3d+mtut model * add nvgesture i3d configs * webcam add hand detector * gesture recognition with bbox * add hand detector config * fix gesture recognizer init bug * webcam/gesture - recognizer runs successfully * delete unnecessary comment * fix lint error in gesture configs * add nvgesture category info * webcam/gesture - display gesture recognition result * add gesture recognition related docs * update gesture related comments * update light hand det model in demo doc * update gesture recognition configs and results * auto modify model-index.yml * stabilize ssa loss in mtut * add multi-input node comment * synchronize tools/webcam with master * add gesture task-name mapping * move gesture configs to configs/hand/ * fix a bug in demo (#1373) * update gesture datasets * add ModelSetEpochHook * nvgesture dataset support multi-GPU evalutation * add i3d+mtut model * add nvgesture i3d configs * webcam add hand detector * gesture recognition with bbox * add hand detector config * fix gesture recognizer init bug * webcam/gesture - recognizer runs successfully * delete unnecessary comment * fix lint error in gesture configs * add nvgesture category info * webcam/gesture - display gesture recognition result * add gesture recognition related docs * update gesture related comments * update light hand det model in demo doc * update gesture recognition configs and results * auto modify model-index.yml * stabilize ssa loss in mtut * add multi-input node comment * synchronize tools/webcam with master * add gesture task-name mapping * move gesture configs to configs/hand/ * solve conflict in mmdet_modelzoo.md * add gesture recogition into webcam * hand gesture inference config explanation * add gesture recognizer node in __init__.py * add gesture webcam readme * Adjust inference tracking min keypoints (#1398) * Adjust inference tracking min keypoints * Special case for min_keypoints <= 0 doesn't seem to be required * remove unnecessary transformer utils (#1405) * add nvgesture dataset * fix nvgesture pipelines * update gesture datasets * add ModelSetEpochHook * nvgesture dataset support multi-GPU evalutation * add i3d+mtut model * add nvgesture i3d configs * webcam add hand detector * gesture recognition with bbox * add hand detector config * fix gesture recognizer init bug * webcam/gesture - recognizer runs successfully * delete unnecessary comment * fix lint error in gesture configs * add nvgesture category info * webcam/gesture - display gesture recognition result * add gesture recognition related docs * update gesture related comments * update light hand det model in demo doc * update gesture recognition configs and results * auto modify model-index.yml * stabilize ssa loss in mtut * add multi-input node comment * synchronize tools/webcam with master * add gesture task-name mapping * move gesture configs to configs/hand/ * fix grammer errors in docs * fix a lint error in doc * update nvgesture evaluation * add introduction and assertion to TemporalPooling * generalize NVGestureRandomFlip * add unittests for gesture pipelines * delete duplicated config * add gesture inference unittest * add gesture dataset unittest * fix gesture inference unittest error * add backbone I3D unittest * add mtut head unittest * fix mtut head unittest error * add gesture recognizer unittest Co-authored-by: Yining Li <liyining0712@gmail.com> Co-authored-by: Philipp Allgeuer <5592992+pallgeuer@users.noreply.github.com>
open-mmlab · Jun 2, 2022 · d3c17d5 · d3c17d5
1 parent 39f9dc8
commit d3c17d5
Show file tree

Hide file tree

Showing 52 changed files with 3,692 additions and 31 deletions.
diff --git a/.dev_scripts/github/update_model_index.py b/.dev_scripts/github/update_model_index.py
@@ -151,6 +151,7 @@ def parse_config_path(path):
         '3d_kpt_mview_rgb_img': '3D Keypoint',
         '3d_kpt_sview_rgb_vid': '3D Keypoint',
         '3d_mesh_sview_rgb_img': '3D Mesh',
+        'gesture_sview_rgbd_vid': 'Gesture',
         None: None
     }
     task_readable = task2readable.get(task)

diff --git a/configs/_base_/datasets/nvgesture.py b/configs/_base_/datasets/nvgesture.py
@@ -0,0 +1,42 @@
+dataset_info = dict(
+    dataset_name='nvgesture',
+    paper_info=dict(
+        author='Pavlo Molchanov and Xiaodong Yang and Shalini Gupta '
+        'and Kihwan Kim and Stephen Tyree and Jan Kautz',
+        title='Online Detection and Classification of Dynamic Hand Gestures '
+        'with Recurrent 3D Convolutional Neural Networks',
+        container='Proceedings of the IEEE Conference on '
+        'Computer Vision and Pattern Recognition',
+        year='2016',
+        homepage='https://research.nvidia.com/publication/2016-06_online-'
+        'detection-and-classification-dynamic-hand-gestures-recurrent-3d',
+    ),
+    category_info={
+        0: 'five fingers move right',
+        1: 'five fingers move left',
+        2: 'five fingers move up',
+        3: 'five fingers move down',
+        4: 'two fingers move right',
+        5: 'two fingers move left',
+        6: 'two fingers move up',
+        7: 'two fingers move down',
+        8: 'click',
+        9: 'beckoned',
+        10: 'stretch hand',
+        11: 'shake hand',
+        12: 'one',
+        13: 'two',
+        14: 'three',
+        15: 'lift up',
+        16: 'press down',
+        17: 'push',
+        18: 'shrink',
+        19: 'levorotation',
+        20: 'dextrorotation',
+        21: 'two fingers prod',
+        22: 'grab',
+        23: 'thumbs up',
+        24: 'OK'
+    },
+    flip_pairs=[(0, 1), (4, 5), (19, 20)],
+    fps=30)
diff --git a/configs/hand/gesture_sview_rgbd_vid/README.md b/configs/hand/gesture_sview_rgbd_vid/README.md
@@ -0,0 +1,7 @@
+# Gesture Recognition
+
+Gesture recognition aims to recognize the hand gestures in the video, such as thumbs up.
+
+## Data preparation
+
+Please follow [DATA Preparation](/docs/en/tasks/2d_hand_gesture.md) to prepare data.
diff --git a/configs/hand/gesture_sview_rgbd_vid/mtut/README.md b/configs/hand/gesture_sview_rgbd_vid/mtut/README.md
@@ -0,0 +1,8 @@
+# Multi-modal Training and Uni-modal Testing (MTUT) for gesture recognition
+
+MTUT method uses multi-modal data in the training phase, such as RGB videos and depth videos.
+For each modality, an I3D network is trained to conduct gesture recognition. The property
+of spatial-temporal semantic alignment across multi-modal data is utilized to supervise the
+learning, in order to improve the performance of each I3D network for a single modality.
+
+In the testing phase, uni-modal data, generally RGB video, is used.
diff --git a/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture.md b/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture.md
@@ -0,0 +1,60 @@
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="https://openaccess.thecvf.com/content_CVPR_2019/html/Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.html">MTUT (CVPR'2019)</a></summary>
+
+```bibtex
+@InProceedings{Abavisani_2019_CVPR,
+  author = {Abavisani, Mahdi and Joze, Hamid Reza Vaezi and Patel, Vishal M.},
+  title = {Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training},
+  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+  month = {June},
+  year = {2019}
+}
+```
+
+</details>
+
+<!-- [BACKBONE] -->
+
+<details>
+<summary align="right"><a href="https://openaccess.thecvf.com/content_cvpr_2017/html/Carreira_Quo_Vadis_Action_CVPR_2017_paper.html">I3D (CVPR'2017)</a></summary>
+
+```bibtex
+@InProceedings{Carreira_2017_CVPR,
+  author = {Carreira, Joao and Zisserman, Andrew},
+  title = {Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset},
+  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+  month = {July},
+  year = {2017}
+}
+```
+
+</details>
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="https://openaccess.thecvf.com/content_cvpr_2016/html/Molchanov_Online_Detection_and_CVPR_2016_paper.html">NVGesture (CVPR'2016)</a></summary>
+
+```bibtex
+@InProceedings{Molchanov_2016_CVPR,
+  author = {Molchanov, Pavlo and Yang, Xiaodong and Gupta, Shalini and Kim, Kihwan and Tyree, Stephen and Kautz, Jan},
+  title = {Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Network},
+  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+  month = {June},
+  year = {2016}
+}
+```
+
+</details>
+
+Results on NVGesture test set
+
+| Arch                                                    | Input Size | fps |   bbox    | AP_rgb | AP_depth |                          ckpt                           |                          log                           |
+| :------------------------------------------------------ | :--------: | :-: | :-------: | :----: | :------: | :-----------------------------------------------------: | :----------------------------------------------------: |
+| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15.py)$^\*$ |  112x112   | 15  | $\\surd$  | 0.725  |  0.730   | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-363b5956_20220530.pth) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-20220530.log.json) |
+| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_224x224_fps30.py) |  224x224   | 30  | $\\surd$  | 0.782  |  0.811   | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-98a8f288_20220530.pthh) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-20220530.log.json) |
+| [I3D+MTUT](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py) |  224x224   | 30  | $\\times$ | 0.739  |  0.809   | [ckpt](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-b7abf574_20220530.pth) | [log](https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-20220530.log.json) |
+
+$^\*$: MTUT supports multi-modal training and uni-modal testing. Model trained with this config can be used to recognize gestures in rgb videos with [inference config](/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15_rgb.py).
diff --git a/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture.yml b/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture.yml
@@ -0,0 +1,49 @@
+Collections:
+- Name: MTUT
+  Paper:
+    Title: Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition
+      With Multimodal Training
+    URL: https://openaccess.thecvf.com/content_CVPR_2019/html/Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.html
+  README: https://github.com/open-mmlab/mmpose/blob/master/docs/en/papers/algorithms/mtut.md
+Models:
+- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_112x112_fps15.py
+  In Collection: MTUT
+  Metadata:
+    Architecture: &id001
+    - MTUT
+    - I3D
+    Training Data: NVGesture
+  Name: mtut_i3d_nvgesture_bbox_112x112_fps15
+  Results:
+  - Dataset: NVGesture
+    Metrics:
+      AP depth: 0.73
+      AP rgb: 0.725
+    Task: Hand Gesture
+  Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_112x112_fps15-363b5956_20220530.pth
+- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_bbox_224x224_fps30.py
+  In Collection: MTUT
+  Metadata:
+    Architecture: *id001
+    Training Data: NVGesture
+  Name: mtut_i3d_nvgesture_bbox_224x224_fps30
+  Results:
+  - Dataset: NVGesture
+    Metrics:
+      AP depth: 0.811
+      AP rgb: 0.782
+    Task: Hand Gesture
+  Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_bbox_224x224_fps30-98a8f288_20220530.pthh
+- Config: configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py
+  In Collection: MTUT
+  Metadata:
+    Architecture: *id001
+    Training Data: NVGesture
+  Name: mtut_i3d_nvgesture_224x224_fps30
+  Results:
+  - Dataset: NVGesture
+    Metrics:
+      AP depth: 0.809
+      AP rgb: 0.739
+    Task: Hand Gesture
+  Weights: https://download.openmmlab.com/mmpose/gesture/mtut/i3d_nvgesture/i3d_nvgesture_224x224_fps30-b7abf574_20220530.pth
diff --git a/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py b/configs/hand/gesture_sview_rgbd_vid/mtut/nvgesture/i3d_nvgesture_224x224_fps30.py
@@ -0,0 +1,128 @@
+_base_ = [
+    '../../../../_base_/default_runtime.py',
+    '../../../../_base_/datasets/nvgesture.py'
+]
+
+checkpoint_config = dict(interval=5)
+evaluation = dict(interval=5, metric='AP', save_best='AP_rgb')
+
+optimizer = dict(
+    type='SGD',
+    lr=1e-2,
+    momentum=0.9,
+)
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(policy='step', gamma=0.1, step=[30, 50])
+total_epochs = 75
+log_config = dict(interval=10)
+
+custom_hooks_config = [dict(type='ModelSetEpochHook')]
+
+model = dict(
+    type='GestureRecognizer',
+    modality=['rgb', 'depth'],
+    pretrained=dict(
+        rgb='https://github.com/hassony2/kinetics_i3d_pytorch/'
+        'raw/master/model/model_rgb.pth',
+        depth='https://github.com/hassony2/kinetics_i3d_pytorch/'
+        'raw/master/model/model_rgb.pth',
+    ),
+    backbone=dict(
+        rgb=dict(
+            type='I3D',
+            in_channels=3,
+            expansion=1,
+        ),
+        depth=dict(
+            type='I3D',
+            in_channels=1,
+            expansion=1,
+        ),
+    ),
+    cls_head=dict(
+        type='MultiModalSSAHead',
+        num_classes=25,
+    ),
+    train_cfg=dict(
+        beta=2,
+        lambda_=5e-3,
+        ssa_start_epoch=61,
+    ),
+    test_cfg=dict(),
+)
+
+data_cfg = dict(
+    video_size=[320, 240],
+    modality=['rgb', 'depth'],
+)
+
+train_pipeline = [
+    dict(type='LoadVideoFromFile'),
+    dict(type='ModalWiseChannelProcess'),
+    dict(type='CropValidClip'),
+    dict(type='TemporalPooling', length=64, ref_fps=30),
+    dict(type='ResizeGivenShortEdge', length=256),
+    dict(type='RandomAlignedSpatialCrop', length=224),
+    dict(type='GestureRandomFlip'),
+    dict(type='MultiModalVideoToTensor'),
+    dict(
+        type='VideoNormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(
+        type='Collect', keys=['video', 'label'], meta_keys=['fps',
+                                                            'modality']),
+]
+
+val_pipeline = [
+    dict(type='LoadVideoFromFile'),
+    dict(type='ModalWiseChannelProcess'),
+    dict(type='CropValidClip'),
+    dict(type='TemporalPooling', length=-1, ref_fps=30),
+    dict(type='ResizeGivenShortEdge', length=256),
+    dict(type='CenterSpatialCrop', length=224),
+    dict(type='MultiModalVideoToTensor'),
+    dict(
+        type='VideoNormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(
+        type='Collect', keys=['video', 'label'], meta_keys=['fps',
+                                                            'modality']),
+]
+
+test_pipeline = val_pipeline
+
+data_root = 'data/nvgesture'
+data = dict(
+    samples_per_gpu=6,
+    workers_per_gpu=2,
+    val_dataloader=dict(samples_per_gpu=6),
+    test_dataloader=dict(samples_per_gpu=6),
+    train=dict(
+        type='NVGestureDataset',
+        ann_file=f'{data_root}/annotations/'
+        'nvgesture_train_correct_cvpr2016_v2.lst',
+        vid_prefix=f'{data_root}/',
+        data_cfg=data_cfg,
+        pipeline=train_pipeline,
+        dataset_info={{_base_.dataset_info}}),
+    val=dict(
+        type='NVGestureDataset',
+        ann_file=f'{data_root}/annotations/'
+        'nvgesture_test_correct_cvpr2016_v2.lst',
+        vid_prefix=f'{data_root}/',
+        data_cfg=data_cfg,
+        pipeline=val_pipeline,
+        test_mode=True,
+        dataset_info={{_base_.dataset_info}}),
+    test=dict(
+        type='NVGestureDataset',
+        ann_file=f'{data_root}/annotations/'
+        'nvgesture_test_correct_cvpr2016_v2.lst',
+        vid_prefix=f'{data_root}/',
+        data_cfg=data_cfg,
+        pipeline=test_pipeline,
+        test_mode=True,
+        dataset_info={{_base_.dataset_info}}))