This repository contains training code for the paper Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose. This work heavily optimizes the OpenPose approach to reach real-time inference on CPU with a negligible accuracy drop. It detects a skeleton (which consists of keypoints and connections between them) to identify human poses for every person inside the image. The pose may contain up to 18 keypoints: ears, eyes, nose, neck, shoulders, elbows, wrists, hips, knees, and ankles. This code achieves a 40% AP for the single-scale inference on the COCO 2017 Keypoint Detection validation set without a flip or post-processing. You can reproduce the result using this repository.
The implementation of our paper for single-person pose estimation problem Global Context for Convolutional Pose Machines is added. See README_single.md for details.
- Requirements
- Prerequisites
- Training
- Validation
- Pretrained Model
- Output Format
- C++ Demo
- Python Demo
- Fine-Tuning
- Citation
- Ubuntu* 16.04
- Python* 3.6
- PyTorch* 0.4.1 (PyTorch* 1.0 not validated)
-
Download the COCO 2017 dataset: http://cocodataset.org/#download (train, val, annotations) and unpack it to the
<COCO_HOME>
folder. -
Create virtual environment:
bash init_venv.sh
Training consists of three steps (given AP values for full validation dataset):
- Training from MobileNet weights. Expected AP after this step is ~38%.
- Training from weights obtained from the previous step. Expected AP after this step is ~39%.
- Training from weights obtained from the previous step and increased number of refinement stages to 3 in the network. Expected AP after this step is ~40% (for a network with one refinement stage, two next ones are discarded).
-
Download the pretrained MobileNet v1 weights
mobilenet_sgd_68.848.pth.tar
from the pytorch-mobilenet repository (choose the sgd option). -
Convert the train annotations to the internal format by running the following:
python scripts/prepare_train_labels.py --labels <COCO_HOME>/annotations/person_keypoints_train2017.json
This produces the
prepared_train_annotation.pkl
file with annotations converted in the internal format. -
Optional. For fast validation, make a subset of the validation dataset by running the following:
python scripts/make_val_subset.py --labels <COCO_HOME>/annotations/person_keypoints_val2017.json
This produces the
val_subset.json
file with annotations just for 250 random images (out of 5000). -
To train from the MobileNet weights, run the following:
python train.py \ --train-images-folder <COCO_HOME>/train2017/ \ --prepared-train-labels prepared_train_annotation.pkl \ --val-labels val_subset.json \ --val-images-folder <COCO_HOME>/val2017/ \ --checkpoint-path <path_to>/mobilenet_sgd_68.848.pth.tar \ --from-mobilenet
-
To train from the checkpoint from the previous step, run the command below:
python train.py \ --train-images-folder <COCO_HOME>/train2017/ \ --prepared-train-labels prepared_train_annotation.pkl \ --val-labels val_subset.json \ --val-images-folder <COCO_HOME>/val2017/ \ --checkpoint-path <path_to>/checkpoint_iter_420000.pth \ --weights-only
-
To train from the checkpoint from the previous step and three refinement stages in the network, run the following:
python train.py \ --train-images-folder <COCO_HOME>/train2017/ \ --prepared-train-labels prepared_train_annotation.pkl \ --val-labels val_subset.json \ --val-images-folder <COCO_HOME>/val2017/ \ --checkpoint-path <path_to>/checkpoint_iter_280000.pth \ --weights-only \ --num-refinement-stages 3
We took the checkpoint after 370000 iterations as the final one.
We did not perform the best checkpoint selection at any step, so similar result may be achieved after less number of iterations.
We observe this error with maximum number of open files (ulimit -n
) equals to 1024:
File "train.py", line 164, in <module>
args.log_after, args.val_labels, args.val_images_folder, args.val_output_name, args.checkpoint_after, args.val_after)
File "train.py", line 77, in train
for _, batch_data in enumerate(train_loader):
File "/<path>/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
idx, batch = self._get_batch()
File "/<path>/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
return self.data_queue.get()
File "/<path>/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/<path>/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/<path>/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/<path>/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/<path>/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
To get rid of it, increase the limit to a bigger number. For example, to increase the limit to 65536, run the following in the terminal: ulimit -n 65536
.
Run the following:
python val.py \
--labels <COCO_HOME>/annotations/person_keypoints_val2017.json \
--images-folder <COCO_HOME>/val2017 \
--checkpoint-path <CHECKPOINT>
The model expects a normalized image (mean=[128, 128, 128], scale=[1/256, 1/256, 1/256]) in the planar BGR format. A model pretrained on COCO is available at the Intel® Open Source Technology Center, it has 40% of AP on the COCO validation set (38.6% of AP on the val subset).
-
Convert a PyTorch* model to the ONNX* format by running the script in the terminal:
python scripts/convert_to_onnx.py --checkpoint-path <CHECKPOINT>
The script produces
human-pose-estimation.onnx
. -
Convert the ONNX model to the OpenVINO™ format with Model Optimizer by running the script below in the terminal:
python <OpenVINO_INSTALL_DIR>/deployment_tools/model_optimizer/mo.py \ --input_model human-pose-estimation.onnx \ --input data \ --mean_values data[128.0,128.0,128.0] \ --scale_values data[256] \ --output stage_1_output_0_pafs,stage_1_output_1_heatmaps
This produces the
human-pose-estimation.xml
model and weightshuman-pose-estimation.bin
in single-precision floating-point format (FP32).
The network has two outputs:
- Keypoints heatmaps
- Part affinity fields (used in grouping by persons instances)
These outputs are post-processed, and group_keypoints
function returns recognized poses, represented as two lists:
all_keypoints
- list of keypoints for all poses found in an image. Each element is a keypoint with its confidence and global ID, represented as a tuple with 4 elements: 0, 1 - keypoints coordinates, 2 - keypoint confidence (value from heatmap) and the last one is the global keypoint id in this list.pose_entries
- list of poses. Each pose is a list of 20 elements: [0,17] - global keypoints IDs, 18 is the pose confidence (sum of keypoints confidences and confidences of connections between keypoints), the last one is the number of recognized keypoints for the pose.
Keypoints order and names are defined in the Pose class. Example of parsing is shown in the demo.
To run the demo, download the Intel® OpenVINO™ Toolkit https://software.intel.com/en-us/openvino-toolkit/choose-download, install it and build the samples (Inferring Your Model with the Inference Engine Samples part). Then run <SAMPLES_BIN_FOLDER>/human_pose_estimation_demo -m <path_to>/human-pose-estimation.xml -i <path_to_video_file>
for the inference on CPU.
We provide the Python demo just for the quick results preview. Consider the C++ demo for the best performance. To run the Python demo from a webcam, run the following:
python demo.py --checkpoint-path <path_to>/checkpoint_iter_370000.pth --video 0
- The annotations have to be in the COCO format.
- If the number of keypoints differs from this model (18), then the network structure and result parsing code (e.g.
modules/keypoints.py
) adjustments are required. - The output order of keypoints is consistent to the OpenPose format (from 1 to 18).
- Fine-tuning steps are the same as steps 2-3 for training, just pass the path to the pretrained checkpoint and your data.
We will perform fine-tuning on the first half of validation dataset and report the performance on the second part (see image indices in data/val2017_2nd_half_indices.txt
for the reference) after 200 iterations (~6 epoches).
-
Validate before fine-tuning:
python val.py \ --labels val2017_2nd_part.json \ --output-name detections.json \ --images-folder <COCO_HOME>/val2017 \ --checkpoint-path <path_to>/checkpoint_iter_370000.pth
AP before fine-tuning is 39%.
-
Perform fine-tuning for 200 iterations:
python train.py \ --train-images-folder <COCO_HOME>/val2017/ \ --prepared-train-labels prepared_val2017_1st_part_annotation.pkl \ --val-labels val2017_2nd_part.json \ --val-images-folder <COCO_HOME>/val2017/ \ --checkpoint-path <path_to>/checkpoint_iter_370000.pth \ --weights-only \ --checkpoint-after 200 \ --val-after 100 \ --log-after 12 \ --base-lr 0.00000444444
Expected AP after 200 iterations of fine-tuning is 39.1%-39.2%.
If this helps your research, please cite the paper:
@inproceedings{osokin2018lightweight_openpose,
author={Osokin, Daniil},
title={Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose},
booktitle = {arXiv preprint arXiv:1811.12004},
year = {2018}
}