This is an online action segmentation network for 16 classes trained on Intel dataset. It is an online version of MSTCN++. The difference between online MSTCN++ and MSTCN++ is that the former accept stream video as input while the latter assume the whole video is given.
For the original MSTCN++ model details see paper
Metric | Value |
---|---|
GOPs | 0.048915 |
MParams | 1.018179 |
Source framework | PyTorch* |
Accuracy | noise/background | remove_support_sleeve | adjust_rider | adjust_nut | adjust_balancing | open_box | close_box | choose_weight | put_left | put_right | take_left | take_right | install support_sleeve | mean | mPR (P+R)/2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
frame-level | precision | 0.22 | 0.84 | 0.81 | 0.62 | 0.67 | 0.87 | 0.56 | 0.52 | 0.54 | 0.74 | 0.62 | 0.68 | 0.86 | 0.66 | 0.66 |
recall | 0.4 | 0.95 | 0.83 | 0.86 | 0.43 | 0.8 | 0.31 | 0.52 | 0.68 | 0.65 | 0.62 | 0.51 | 0.92 | 0.65 | ||
segment IOU | precision | 0.38 | 0.94 | 0.77 | 0.65 | 0.6 | 0.85 | 0.56 | 0.68 | 0.74 | 0.88 | 0.72 | 0.78 | 0.69 | 0.7 | 0.77 |
recall | 0.64 | 1 | 0.96 | 0.94 | 0.62 | 0.96 | 0.48 | 0.77 | 0.91 | 0.88 | 0.83 | 0.85 | 1 | 0.83 |
Notice: In the accuracy report, feature extraction network is i3d-rgb, you can get this model from ../../public/i3d-rgb-tf/README.md
.
The inputs to the network are feature vectors at each video frame, which should be the output of feature extraction network, such as i3d-rgb-tf and resnet-50-tf, and feature outputs of the previous frame.
You can check the i3d-rgb and smartlab-sequence-modelling-0001 usage in demos/smartlab_demo
-
Input feature, name:
input
, shape:1, 2048, 24
, format:B, W, H
, where:B
- batch sizeW
- feature map widthH
- feature map height
-
History feature 1, name:
fhis_in_0
, shape:12, 64, 2048
, format:C, H', W
, -
History feature 2, name:
fhis_in_1
, shape:11, 64, 2048
, format:C, H', W
, -
History feature 3, name:
fhis_in_2
, shape:11, 64, 2048
, format:C, H', W
, -
History feature 4, name:
fhis_in_3
, shape:11, 64, 2048
, format:C, H', W
, where:C
- the channel number of feature vectorH
- feature map heightW
- feature map width
The outputs also include two parts: predictions and four feature outputs. Predictions is the action classification and prediction results. Four Feature maps are the model layer features in past frames.
-
Prediction, name:
output
, shape:4, 1, 64, 24
, format:C, B, H, W
,C
- the channel number of feature vectorB
- batch sizeH
- feature map heightW
- feature map width After post-process with argmax() function, the prediction result can be used to decide the action type of the current frame.
-
History feature 1, name:
fhis_out_0
, shape:12, 64, 2048
, format:C, H, W
, -
History feature 2, name:
fhis_out_1
, shape:11, 64, 2048
, format:C, H, W
, -
History feature 3, name:
fhis_out_2
, shape:11, 64, 2048
, format:C, H, W
, -
History feature 4, name:
fhis_out_3
, shape:11, 64, 2048
, format:C, H, W
, where:C
- the channel number of feature vectorH
- feature map heightW
- feature map width
[*] Other names and brands may be claimed as the property of others.