Here we propose a novel Space-Time-Separable Graph Convolutional Network (STS-GCN) for pose forecasting. For the first time, STS-GCN models the human pose dynamics only with a graph convolutional network (GCN), including the temporal evolution and the spatial joint interaction within a single-graph framework, which allows the cross-talk of motion and spatial correlations. Concurrently, STS-GCN is the first space-time-separable GCN: the space-time graph connectivity is factored into space and time affinity matrices, which bottlenecks the space-time cross-talk, while enabling full joint-joint and time-time correlations. Both affinity matrices are learnt end-to-end, which results in connections substantially deviating from the standard kinematic tree and the linear-time time series.
In experimental evaluation on three complex, recent and large-scale benchmarks, Human3.6M [Ionescu et al. TPAMI'14], AMASS [Mahmood et al. ICCV'19] and 3DPW [Von Marcard et al. ECCV'18], STS-GCN outperforms the state-of-the-art, surpassing the current best technique [Mao et al. ECCV'20] by over 32% in average in the most difficult long-term predictions, while only requiring 1.7% of its parameters. We explain the results qualitatively and illustrate the graph attention by the factored joint-joint and time-time learnt graph connections.
A problem arises because no prior human pose forecasting work has explicitly written the test MPJPE metric. [Mao et al., 2020, Mao et al., 2019] have specified the MPJPE for the learning loss, and they have referred to the (same) MPJPE for testing, which is however different.
In [Mao et al., 2020], Eq. (6), they define MPJPE as
$$MPJPE = \frac{1}{J(M+T)}\sum_{t=1}^{M+T} \sum_{j=1}^J ||\hat{\textbf{p}}{t,j} - \textbf{p}{t,j} ||^2,$$
which sums up all errors at all frames up to the prediction T.
Also in [Ionescu et al., 2014], Eq. (8), they define the MPJPE as:
$$MPJPE(t) = \frac{1}{J} \sum_{j=1}^J ||\hat{\textbf{p}{t,j} }- \textbf{p}{t,j} ||^2,$$
and they state: "For a set of frames the error is the average over the MPJPEs of all frames."
We have therefore interpreted the test MPJPE to be:
$$MPJPE = \frac{1}{J T}\sum_{t=M+1}^{M+T} \sum_{j=1}^J ||\hat{\textbf{p}}{t,j} - \textbf{p}{t,j} ||^2,$$
which is implemented in our testing code. Note: coding has been done in good faith, and in good faith we have open-sourced the project here.
As noted in this thread, the code provided by [Mao et al., 2020] actually considers only the target temporal horizon, not the average up to that time.
Running the test code of [Mao et al., 2020], short-term (400ms) and long-term (1000ms) errors for the Human3.6M dataset for STS-GCN are:
Here we report this performance and specify the test MPJPE error, to avoid future discrepancies.
$ pip install -r requirements.txt
Human3.6m in exponential map can be downloaded from here.
Directory structure:
H3.6m
|-- S1
|-- S5
|-- S6
|-- ...
`-- S11
AMASS from their official website.
Directory structure:
amass
|-- ACCAD
|-- BioMotionLab_NTroje
|-- CMU
|-- ...
`-- Transitions_mocap
3DPW from their official website.
Directory structure:
3dpw
|-- imageFiles
| |-- courtyard_arguing_00
| |-- courtyard_backpack_00
| |-- ...
`-- sequenceFiles
|-- test
|-- train
`-- validation
Put the all downloaded datasets in ../datasets directory.
The arguments for running the code are defined in parser.py. We have used the following commands for training the network,on different datasets and body pose representations(3D and euler angles):
python main_h36_3d.py --input_n 10 --output_n 25 --skip_rate 1 --joints_to_consider 22
python main_h36_ang.py --input_n 10 --output_n 25 --skip_rate 1 --joints_to_consider 16
python main_amass_3d.py --input_n 10 --output_n 25 --skip_rate 5 --joints_to_consider 18
To test on the pretrained model, we have used the following commands:
python main_h36_3d.py --input_n 10 --output_n 25 --skip_rate 1 --joints_to_consider 22 --mode test --model_path ./checkpoints/CKPT_3D_H36M
python main_h36_ang.py --input_n 10 --output_n 25 --skip_rate 1 --joints_to_consider 16 --mode test --model_path ./checkpoints/CKPT_ANG_H36M
python main_amass_3d.py --input_n 10 --output_n 25 --skip_rate 5 --joints_to_consider 18 --mode test --model_path ./checkpoints/CKPT_3D_AMASS
For visualizing from a pretrained model, we have used the following commands:
python main_h36_3d.py --input_n 10 --output_n 25 --skip_rate 1 --joints_to_consider 22 --mode viz --model_path ./checkpoints/CKPT_3D_H36M --n_viz 5
python main_h36_ang.py --input_n 10 --output_n 25 --skip_rate 1 --joints_to_consider 16 --mode viz --model_path ./checkpoints/CKPT_ANG_H36M --n_viz 5
python main_amass_3d.py --input_n 10 --output_n 25 --skip_rate 5 --joints_to_consider 18 --mode viz --model_path ./checkpoints/CKPT_3D_AMASS --n_viz 5
If you use our code,please cite our work
@misc{sofianos2021spacetimeseparable,
title={Space-Time-Separable Graph Convolutional Network for Pose Forecasting},
author={Theodoros Sofianos and Alessio Sampieri and Luca Franco and Fabio Galasso},
year={2021},
eprint={2110.04573},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Some of our code was adapted from HisRepsItself by Wei Mao.
The authors wish to acknowledge Panasonic for partially supporting this work and the project of the Italian Ministry of Education, Universities and Research (MIUR) “Dipartimenti di Eccellenza 2018-2022”.
MIT license