Skip to content

Commit 56cef36

Browse files
authored
Update handpose estimation model from MediaPipe (2023feb) (#133)
* update handpose model * update quantize model * fix quantize path * update readme of quantization and benchmark result * fix document
1 parent d23727e commit 56cef36

13 files changed

+217
-63
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Guidelines:
3535
| [DaSiamRPN](./models/object_tracking_dasiamrpn) | Object Tracking | 1280x720 | 36.15 | 705.48 | 76.82 | --- | --- |
3636
| [YoutuReID](./models/person_reid_youtureid) | Person Re-Identification | 128x256 | 35.81 | 521.98 | 90.07 | 44.61 | --- |
3737
| [MP-PalmDet](./models/palm_detection_mediapipe) | Palm Detection | 192x192 | 11.09 | 63.79 | 83.20 | 33.81 | --- |
38-
| [MP-HandPose](./models/handpose_estimation_mediapipe) | Hand Pose Estimation | 256x256 | 20.16 | 148.24 | 156.30 | 42.70 | --- |
38+
| [MP-HandPose](./models/handpose_estimation_mediapipe) | Hand Pose Estimation | 224x224 | 4.28 | 36.19 | 40.10 | 19.47 | --- |
3939

4040
\*: Models are quantized in per-channel mode, which run slower than per-tensor quantized models on NPU.
4141

@@ -91,7 +91,7 @@ Some examples are listed below. You can find more in the directory of each model
9191

9292
### Hand Pose Estimation with [MP-HandPose](models/handpose_estimation_mediapipe/)
9393

94-
![handpose estimation](models/handpose_estimation_mediapipe/examples/mphandpose_demo.gif)
94+
![handpose estimation](models/handpose_estimation_mediapipe/examples/mphandpose_demo.webp)
9595

9696
### QR Code Detection and Parsing with [WeChatQRCode](./models/qrcode_wechatqrcode/)
9797

benchmark/config/handpose_estimation_mediapipe.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Benchmark:
55
path: "data/palm_detection_20230125"
66
files: ["palm1.jpg", "palm2.jpg", "palm3.jpg"]
77
sizes: # [[w1, h1], ...], Omit to run at original scale
8-
- [256, 256]
8+
- [224, 224]
99
metric:
1010
warmup: 30
1111
repeat: 10

models/handpose_estimation_mediapipe/README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,14 @@ This model estimates 21 hand keypoints per detected hand from [palm detector](..
44

55
![MediaPipe Hands Keypoints](./examples/hand_keypoints.png)
66

7-
This model is converted from Tensorflow-JS to ONNX using following tools:
8-
- tfjs to tf_saved_model: https://github.com/patlevin/tfjs-to-tf/
9-
- tf_saved_model to ONNX: https://github.com/onnx/tensorflow-onnx
7+
This model is converted from TFlite to ONNX using following tools:
8+
- TFLite model to ONNX: https://github.com/onnx/tensorflow-onnx
109
- simplified by [onnx-simplifier](https://github.com/daquexian/onnx-simplifier)
1110

11+
**Note**:
12+
- The int8-quantized model may produce invalid results due to a significant drop of accuracy.
13+
- Visit https://google.github.io/mediapipe/solutions/models.html#hands for models of larger scale.
14+
1215
## Demo
1316

1417
Run the following commands to try the demo:
@@ -21,7 +24,7 @@ python demo.py -i /path/to/image
2124

2225
### Example outputs
2326

24-
![webcam demo](./examples/mphandpose_demo.gif)
27+
![webcam demo](./examples/mphandpose_demo.webp)
2528

2629
## License
2730

@@ -30,3 +33,5 @@ All files in this directory are licensed under [Apache 2.0 License](./LICENSE).
3033
## Reference
3134

3235
- MediaPipe Handpose: https://github.com/tensorflow/tfjs-models/tree/master/handpose
36+
- MediaPipe hands model and model card: https://google.github.io/mediapipe/solutions/models.html#hands
37+
- Int8 model quantized with rgb evaluation set of FreiHAND: https://lmb.informatik.uni-freiburg.de/resources/datasets/FreihandDataset.en.html

models/handpose_estimation_mediapipe/demo.py

Lines changed: 101 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -31,69 +31,126 @@ def str2bool(v):
3131

3232
parser = argparse.ArgumentParser(description='Hand Pose Estimation from MediaPipe')
3333
parser.add_argument('--input', '-i', type=str, help='Path to the input image. Omit for using default camera.')
34-
parser.add_argument('--model', '-m', type=str, default='./handpose_estimation_mediapipe_2022may.onnx', help='Path to the model.')
34+
parser.add_argument('--model', '-m', type=str, default='./handpose_estimation_mediapipe_2023feb.onnx', help='Path to the model.')
3535
parser.add_argument('--backend', '-b', type=int, default=backends[0], help=help_msg_backends.format(*backends))
3636
parser.add_argument('--target', '-t', type=int, default=targets[0], help=help_msg_targets.format(*targets))
37-
parser.add_argument('--conf_threshold', type=float, default=0.8, help='Filter out hands of confidence < conf_threshold.')
37+
parser.add_argument('--conf_threshold', type=float, default=0.9, help='Filter out hands of confidence < conf_threshold.')
3838
parser.add_argument('--save', '-s', type=str, default=False, help='Set true to save results. This flag is invalid when using camera.')
3939
parser.add_argument('--vis', '-v', type=str2bool, default=True, help='Set true to open a window for result visualization. This flag is invalid when using camera.')
4040
args = parser.parse_args()
4141

4242

4343
def visualize(image, hands, print_result=False):
44-
output = image.copy()
44+
display_screen = image.copy()
45+
display_3d = np.zeros((400, 400, 3), np.uint8)
46+
cv.line(display_3d, (200, 0), (200, 400), (255, 255, 255), 2)
47+
cv.line(display_3d, (0, 200), (400, 200), (255, 255, 255), 2)
48+
cv.putText(display_3d, 'Main View', (0, 12), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
49+
cv.putText(display_3d, 'Top View', (200, 12), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
50+
cv.putText(display_3d, 'Left View', (0, 212), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
51+
cv.putText(display_3d, 'Right View', (200, 212), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
52+
is_draw = False # ensure only one hand is drawn
53+
54+
def draw_lines(image, landmarks, is_draw_point=True, thickness=2):
55+
cv.line(image, landmarks[0], landmarks[1], (255, 255, 255), thickness)
56+
cv.line(image, landmarks[1], landmarks[2], (255, 255, 255), thickness)
57+
cv.line(image, landmarks[2], landmarks[3], (255, 255, 255), thickness)
58+
cv.line(image, landmarks[3], landmarks[4], (255, 255, 255), thickness)
59+
60+
cv.line(image, landmarks[0], landmarks[5], (255, 255, 255), thickness)
61+
cv.line(image, landmarks[5], landmarks[6], (255, 255, 255), thickness)
62+
cv.line(image, landmarks[6], landmarks[7], (255, 255, 255), thickness)
63+
cv.line(image, landmarks[7], landmarks[8], (255, 255, 255), thickness)
64+
65+
cv.line(image, landmarks[0], landmarks[9], (255, 255, 255), thickness)
66+
cv.line(image, landmarks[9], landmarks[10], (255, 255, 255), thickness)
67+
cv.line(image, landmarks[10], landmarks[11], (255, 255, 255), thickness)
68+
cv.line(image, landmarks[11], landmarks[12], (255, 255, 255), thickness)
69+
70+
cv.line(image, landmarks[0], landmarks[13], (255, 255, 255), thickness)
71+
cv.line(image, landmarks[13], landmarks[14], (255, 255, 255), thickness)
72+
cv.line(image, landmarks[14], landmarks[15], (255, 255, 255), thickness)
73+
cv.line(image, landmarks[15], landmarks[16], (255, 255, 255), thickness)
74+
75+
cv.line(image, landmarks[0], landmarks[17], (255, 255, 255), thickness)
76+
cv.line(image, landmarks[17], landmarks[18], (255, 255, 255), thickness)
77+
cv.line(image, landmarks[18], landmarks[19], (255, 255, 255), thickness)
78+
cv.line(image, landmarks[19], landmarks[20], (255, 255, 255), thickness)
79+
80+
if is_draw_point:
81+
for p in landmarks:
82+
cv.circle(image, p, thickness, (0, 0, 255), -1)
4583

4684
for idx, handpose in enumerate(hands):
4785
conf = handpose[-1]
4886
bbox = handpose[0:4].astype(np.int32)
49-
landmarks = handpose[4:-1].reshape(21, 2).astype(np.int32)
87+
handedness = handpose[-2]
88+
if handedness <= 0.5:
89+
handedness_text = 'Left'
90+
else:
91+
handedness_text = 'Right'
92+
landmarks_screen = handpose[4:67].reshape(21, 3).astype(np.int32)
93+
landmarks_word = handpose[67:130].reshape(21, 3)
5094

5195
# Print results
5296
if print_result:
5397
print('-----------hand {}-----------'.format(idx + 1))
5498
print('conf: {:.2f}'.format(conf))
99+
print('handedness: {}'.format(handedness_text))
55100
print('hand box: {}'.format(bbox))
56101
print('hand landmarks: ')
57-
for l in landmarks:
102+
for l in landmarks_screen:
103+
print('\t{}'.format(l))
104+
print('hand world landmarks: ')
105+
for l in landmarks_word:
58106
print('\t{}'.format(l))
59107

108+
# draw box
109+
cv.rectangle(display_screen, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 0), 2)
110+
# draw handedness
111+
cv.putText(display_screen, '{}'.format(handedness_text), (bbox[0], bbox[1] + 12), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
60112
# Draw line between each key points
61-
cv.line(output, landmarks[0], landmarks[1], (255, 255, 255), 2)
62-
cv.line(output, landmarks[1], landmarks[2], (255, 255, 255), 2)
63-
cv.line(output, landmarks[2], landmarks[3], (255, 255, 255), 2)
64-
cv.line(output, landmarks[3], landmarks[4], (255, 255, 255), 2)
65-
66-
cv.line(output, landmarks[0], landmarks[5], (255, 255, 255), 2)
67-
cv.line(output, landmarks[5], landmarks[6], (255, 255, 255), 2)
68-
cv.line(output, landmarks[6], landmarks[7], (255, 255, 255), 2)
69-
cv.line(output, landmarks[7], landmarks[8], (255, 255, 255), 2)
70-
71-
cv.line(output, landmarks[0], landmarks[9], (255, 255, 255), 2)
72-
cv.line(output, landmarks[9], landmarks[10], (255, 255, 255), 2)
73-
cv.line(output, landmarks[10], landmarks[11], (255, 255, 255), 2)
74-
cv.line(output, landmarks[11], landmarks[12], (255, 255, 255), 2)
75-
76-
cv.line(output, landmarks[0], landmarks[13], (255, 255, 255), 2)
77-
cv.line(output, landmarks[13], landmarks[14], (255, 255, 255), 2)
78-
cv.line(output, landmarks[14], landmarks[15], (255, 255, 255), 2)
79-
cv.line(output, landmarks[15], landmarks[16], (255, 255, 255), 2)
80-
81-
cv.line(output, landmarks[0], landmarks[17], (255, 255, 255), 2)
82-
cv.line(output, landmarks[17], landmarks[18], (255, 255, 255), 2)
83-
cv.line(output, landmarks[18], landmarks[19], (255, 255, 255), 2)
84-
cv.line(output, landmarks[19], landmarks[20], (255, 255, 255), 2)
85-
86-
for p in landmarks:
87-
cv.circle(output, p, 2, (0, 0, 255), 2)
88-
89-
return output
113+
landmarks_xy = landmarks_screen[:, 0:2]
114+
draw_lines(display_screen, landmarks_xy, is_draw_point=False)
115+
116+
# z value is relative to WRIST
117+
for p in landmarks_screen:
118+
r = max(5 - p[2] // 5, 0)
119+
r = min(r, 14)
120+
cv.circle(display_screen, np.array([p[0], p[1]]), r, (0, 0, 255), -1)
121+
122+
if is_draw is False:
123+
is_draw = True
124+
# Main view
125+
landmarks_xy = landmarks_word[:, [0, 1]]
126+
landmarks_xy = (landmarks_xy * 1000 + 100).astype(np.int32)
127+
draw_lines(display_3d, landmarks_xy, thickness=5)
128+
129+
# Top view
130+
landmarks_xz = landmarks_word[:, [0, 2]]
131+
landmarks_xz[:, 1] = -landmarks_xz[:, 1]
132+
landmarks_xz = (landmarks_xz * 1000 + np.array([300, 100])).astype(np.int32)
133+
draw_lines(display_3d, landmarks_xz, thickness=5)
134+
135+
# Left view
136+
landmarks_yz = landmarks_word[:, [2, 1]]
137+
landmarks_yz[:, 0] = -landmarks_yz[:, 0]
138+
landmarks_yz = (landmarks_yz * 1000 + np.array([100, 300])).astype(np.int32)
139+
draw_lines(display_3d, landmarks_yz, thickness=5)
140+
141+
# Right view
142+
landmarks_zy = landmarks_word[:, [2, 1]]
143+
landmarks_zy = (landmarks_zy * 1000 + np.array([300, 300])).astype(np.int32)
144+
draw_lines(display_3d, landmarks_zy, thickness=5)
145+
146+
return display_screen, display_3d
90147

91148

92149
if __name__ == '__main__':
93150
# palm detector
94151
palm_detector = MPPalmDet(modelPath='../palm_detection_mediapipe/palm_detection_mediapipe_2023feb.onnx',
95152
nmsThreshold=0.3,
96-
scoreThreshold=0.8,
153+
scoreThreshold=0.6,
97154
backendId=args.backend,
98155
targetId=args.target)
99156
# handpose detector
@@ -108,7 +165,7 @@ def visualize(image, hands, print_result=False):
108165

109166
# Palm detector inference
110167
palms = palm_detector.infer(image)
111-
hands = np.empty(shape=(0, 47))
168+
hands = np.empty(shape=(0, 132))
112169

113170
# Estimate the pose of each hand
114171
for palm in palms:
@@ -117,10 +174,12 @@ def visualize(image, hands, print_result=False):
117174
if handpose is not None:
118175
hands = np.vstack((hands, handpose))
119176
# Draw results on the input image
120-
image = visualize(image, hands, True)
177+
image, view_3d = visualize(image, hands, True)
121178

122179
if len(palms) == 0:
123180
print('No palm detected!')
181+
else:
182+
print('Palm detected!')
124183

125184
# Save results
126185
if args.save:
@@ -131,6 +190,7 @@ def visualize(image, hands, print_result=False):
131190
if args.vis:
132191
cv.namedWindow(args.input, cv.WINDOW_AUTOSIZE)
133192
cv.imshow(args.input, image)
193+
cv.imshow('3D HandPose Demo', view_3d)
134194
cv.waitKey(0)
135195
else: # Omit input to call default camera
136196
deviceId = 0
@@ -145,7 +205,7 @@ def visualize(image, hands, print_result=False):
145205

146206
# Palm detector inference
147207
palms = palm_detector.infer(frame)
148-
hands = np.empty(shape=(0, 47))
208+
hands = np.empty(shape=(0, 132))
149209

150210
tm.start()
151211
# Estimate the pose of each hand
@@ -156,12 +216,14 @@ def visualize(image, hands, print_result=False):
156216
hands = np.vstack((hands, handpose))
157217
tm.stop()
158218
# Draw results on the input image
159-
frame = visualize(frame, hands)
219+
frame, view_3d = visualize(frame, hands)
160220

161221
if len(palms) == 0:
162222
print('No palm detected!')
163223
else:
224+
print('Palm detected!')
164225
cv.putText(frame, 'FPS: {:.2f}'.format(tm.getFPS()), (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))
165226

166227
cv.imshow('MediaPipe Handpose Detection Demo', frame)
228+
cv.imshow('3D HandPose Demo', view_3d)
167229
tm.reset()

models/handpose_estimation_mediapipe/handpose_estimation_mediapipe_2022may-int8-quantized.onnx

Lines changed: 0 additions & 3 deletions
This file was deleted.

models/handpose_estimation_mediapipe/handpose_estimation_mediapipe_2022may.onnx

Lines changed: 0 additions & 3 deletions
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:e97bc1fb83b641954d33424c82b6ade719d0f73250bdb91710ecfd5f7b47e321
3+
size 1167628
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:db0898ae717b76b075d9bf563af315b29562e11f8df5027a1ef07b02bef6d81c
3+
size 4099621

models/handpose_estimation_mediapipe/mp_handpose.py

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ def __init__(self, modelPath, confThreshold=0.8, backendId=0, targetId=0):
99
self.backend_id = backendId
1010
self.target_id = targetId
1111

12-
self.input_size = np.array([256, 256]) # wh
12+
self.input_size = np.array([224, 224]) # wh
1313
self.PALM_LANDMARK_IDS = [0, 5, 9, 13, 17, 1, 2]
1414
self.PALM_LANDMARKS_INDEX_OF_PALM_BASE = 0
1515
self.PALM_LANDMARKS_INDEX_OF_MIDDLE_FINGER_BASE = 2
@@ -115,20 +115,25 @@ def infer(self, image, palm):
115115
return results # [bbox_coords, landmarks_coords, conf]
116116

117117
def _postprocess(self, blob, rotated_palm_bbox, angle, rotation_matrix):
118-
landmarks, conf = blob
118+
landmarks, conf, handedness, landmarks_word = blob
119119

120+
conf = conf[0][0]
120121
if conf < self.conf_threshold:
121122
return None
122123

123-
landmarks = landmarks.reshape(-1, 3) # shape: (1, 63) -> (21, 3)
124+
landmarks = landmarks[0].reshape(-1, 3) # shape: (1, 63) -> (21, 3)
125+
landmarks_word = landmarks_word[0].reshape(-1, 3) # shape: (1, 63) -> (21, 3)
124126

125127
# transform coords back to the input coords
126128
wh_rotated_palm_bbox = rotated_palm_bbox[1] - rotated_palm_bbox[0]
127129
scale_factor = wh_rotated_palm_bbox / self.input_size
128130
landmarks[:, :2] = (landmarks[:, :2] - self.input_size / 2) * scale_factor
131+
landmarks[:, 2] = landmarks[:, 2] * max(scale_factor) # depth scaling
129132
coords_rotation_matrix = cv.getRotationMatrix2D((0, 0), angle, 1.0)
130133
rotated_landmarks = np.dot(landmarks[:, :2], coords_rotation_matrix[:, :2])
131134
rotated_landmarks = np.c_[rotated_landmarks, landmarks[:, 2]]
135+
rotated_landmarks_world = np.dot(landmarks_word[:, :2], coords_rotation_matrix[:, :2])
136+
rotated_landmarks_world = np.c_[rotated_landmarks_world, landmarks_word[:, 2]]
132137
# invert rotation
133138
rotation_component = np.array([
134139
[rotation_matrix[0][0], rotation_matrix[1][0]],
@@ -144,12 +149,12 @@ def _postprocess(self, blob, rotated_palm_bbox, angle, rotation_matrix):
144149
original_center = np.array([
145150
np.dot(center, inverse_rotation_matrix[0]),
146151
np.dot(center, inverse_rotation_matrix[1])])
147-
landmarks = rotated_landmarks[:, :2] + original_center
152+
landmarks[:, :2] = rotated_landmarks[:, :2] + original_center
148153

149154
# get bounding box from rotated_landmarks
150155
bbox = np.array([
151-
np.amin(landmarks, axis=0),
152-
np.amax(landmarks, axis=0)]) # [top-left, bottom-right]
156+
np.amin(landmarks[:, :2], axis=0),
157+
np.amax(landmarks[:, :2], axis=0)]) # [top-left, bottom-right]
153158
# shift bounding box
154159
wh_bbox = bbox[1] - bbox[0]
155160
shift_vector = self.HAND_BOX_SHIFT_VECTOR * wh_bbox
@@ -162,4 +167,9 @@ def _postprocess(self, blob, rotated_palm_bbox, angle, rotation_matrix):
162167
center_bbox - new_half_size,
163168
center_bbox + new_half_size])
164169

165-
return np.r_[bbox.reshape(-1), landmarks.reshape(-1), conf[0]]
170+
# [0: 4]: hand bounding box found in image of format [x1, y1, x2, y2] (top-left and bottom-right points)
171+
# [4: 67]: screen landmarks with format [x1, y1, z1, x2, y2 ... x21, y21, z21], z value is relative to WRIST
172+
# [67: 130]: world landmarks with format [x1, y1, z1, x2, y2 ... x21, y21, z21], 3D metric x, y, z coordinate
173+
# [130]: handedness, (left)[0, 1](right) hand
174+
# [131]: confidence
175+
return np.r_[bbox.reshape(-1), landmarks.reshape(-1), rotated_landmarks_world.reshape(-1), handedness[0][0], conf]

models/palm_detection_mediapipe/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ This model detects palm bounding boxes and palm landmarks, and is converted from
77
- SSD Anchors are generated from [GenMediaPipePalmDectionSSDAnchors](https://github.com/VimalMollyn/GenMediaPipePalmDectionSSDAnchors)
88

99

10+
**Note**:
11+
- Visit https://google.github.io/mediapipe/solutions/models.html#hands for models of larger scale.
12+
1013
## Demo
1114

1215
Run the following commands to try the demo:

tools/quantize/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,4 +54,4 @@ python quantize-inc.py model1
5454

5555
## Dataset
5656
Some models are quantized with extra datasets.
57-
- [MP-PalmDet](../../models/palm_detection_mediapipe) int8 model quantized with evaluation set of [FreiHAND](https://lmb.informatik.uni-freiburg.de/resources/datasets/FreihandDataset.en.html). The dataset downloaded from [link](https://lmb.informatik.uni-freiburg.de/data/freihand/FreiHAND_pub_v2_eval.zip). Unpack it and path to `FreiHAND_pub_v2_eval/evaluation/rgb`.
57+
- [MP-PalmDet](../../models/palm_detection_mediapipe) and [MP-HandPose](../../models/handpose_estimation_mediapipe) are quantized with evaluation set of [FreiHAND](https://lmb.informatik.uni-freiburg.de/resources/datasets/FreihandDataset.en.html). Download the dataset from [this link](https://lmb.informatik.uni-freiburg.de/data/freihand/FreiHAND_pub_v2_eval.zip). Unpack it and replace `path/to/dataset` with the path to `FreiHAND_pub_v2_eval/evaluation/rgb`.

0 commit comments

Comments
 (0)