This repository tracks the research papers that address the problem of estimating the pose of multiple objects in 3D from a single RGB(D) image and evaluate on the NOCS-REAL275 dataset. The repository provides a list of research papers that followed up the work presented in the paper Normalized object coordinate space for category-level 6D object pose and size estimation, a brief summary and review of the NOCS datasets, and a comparison table with results from these works on the NOCS dataset.
Have you submitted a new manuscript or your manuscript have been accepted to a conference/journal, and your method was evaluated on the NOCS datasets? Share the manuscript, webpage link, and results. I will include your work in this repository.
- List of papers
- The NOCS CAMERA and REAL275 datasets
- Results on NOCS-REAL275
- Additional references
- Enquiries, Question and Comments
- Licence
The following list is not exhaustive. Note that we differentiate the links between a published paper and an arxiv submission.
[1] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, Normalized object coordinate space for category-level 6D object pose
and size estimation, CVPR, 2019
[paper][webpage][code]
[2] D. Chen, J. Li, Z. Wang, K. Xu, Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation, CVPR, 2020
[paper][webpage][code]
[3] X. Chen, Z. Dong, J. Song, A. Geiger, O. Hilliges, Category Level Object Pose Estimation via Neural Analysis-by-Synthesis, ECCV 2020
[arxiv][webpage][code]
[4] M. Tian, M. H. Ang Jr, G. H. Lee, Shape Prior Deformation for Categorical 6D Object Pose and Size Estimation, ECCV 2020
[paper][code]
[5] C. Wang, R. Martín-Martín, D. Xu, J. Lv, C. Lu, L. Fei-Fei, S. Savarese, Y. Zhu, 6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints, ICRA 2020
[paper][webpage][code]
[6] T. Lee, B. -U. Lee, M. Kim and I. S. Kweon, Category-Level Metric Scale Object Shape and Pose Estimation, in IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8575-8582, Oct. 2021
[paper]
[7] W. Chen, X. Jia, H. J. Chang, J. Duan, L. Shen, A. Leonardis, FS-Net: Fast Shape-based Network for Category-Level 6D Object Pose Estimation with Decoupled Rotation Mechanism, CVPR 2021
[paper][code]
[8] J. Lin, Z. Wei, Z. Li, S. Xu, K. Jia, Y. Li, DualPoseNet: Category-Level 6D Object Pose and Size Estimation Using Dual Pose Network With Refined Learning of Pose Consistency, ICCV 2021
[paper][code]
[9] M. Z. Irshad, T. Kollar, M. Laskey, K. Stone, Z. Kira, CenterSnap: Single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation, ICRA, 2022
[arxiv][webpage][code]
[10] M. Z. Irshad, S. Zakharov, R. Ambrus, T. Kollar, Z. Kira, A. Gaidon, ShAPO: Implicit representations for multi-object shape, appearance, and pose optimization, ECCV, 2002
[arxiv][webpage]
The NOCS CAMERA and REAL275 datasets were provided along with the work done by Wang et al. [1]. All the data can be downloaded here.
- Code and data are not provided with an open license (see Issue#57) and hence both code and data should be assumed under copyright. Because of this, code and data can only be forked, downloaded, and/or viewed, but not used for any other purpose. Authors declared that the data is only for non-commercial use; however, no license is provided when downloading the data that confirms this.
- Software builds on top of Mask R-CNN that was releases under MIT license, and therefore the NOCS repository violates the granted permissions as the original copyright notice and permission notice are not included in all copies or substantial portions of the Software.
The results reported in the following comparison table are taken from the corresponding papers.
The accuracy of the detection of the 3D bounding boxes is measuring using the Jaccard Index (or Intersection over Union, IoU) at different thresholds (i.e., 25, 50, and 75 %). The accuracy of the object pose is measured using a threshold on the translation and rotation errors, and thus counting the number of objects whose pose is within the threshold over the total number of objects. For the rotation error, the thresholds are 5° and 10°. For the translation error, the thresholds are 2 cm, 5 cm, and 10 cm. The two thresholds are combined together for different evaluations. Speed is measured in frame per second (results were analysed and reported by FS-Net).
The motivation and applications behind the choices of these thresholds are not reported, and some choices seems too permissive if we consider grasping as a possible application (e.g., 5 or 10 cm, and 25% for IoU).
Best-performing results for each column are highlighted in bold.
Reference | IoU-25 | IoU-50 | IoU-75 | 5°,2cm | 5°,5cm | 5°,10cm | 10°,2cm | 10°,5cm | 10°,10cm | Speed |
---|---|---|---|---|---|---|---|---|---|---|
[1] NOCS (32 bins)* | 84.4 | 79.3 | -- | -- | 16.1 | -- | -- | 43.7 | 43.1 | 5 |
[1] NOCS (32 bins) | 84.8 | 78.0 | 30.1 | 7.2 | 10.0 | 9.8 | 13.8 | 25.2 | 25.8 | 5 |
[1] NOCS (128 bins) | 84.9 | 80.5 | 30.1 | -- | 9.5 | -- | -- | 26.7 | 26.7 | 5 |
[2] CASS | 84.2 | 77.7 | -- | -- | 23.5 | 23.8 | -- | 58.0 | 58.3 | -- |
[3] Neural-object-fitting | -- | -- | -- | -- | 0.9 | 1.4 | -- | 2.4 | 5.5 | -- |
[4] Deformnet (RGB) | -- | 75.2 | 46.5 | 15.7 | 18.8 | -- | 33.7 | 47.4 | -- | -- |
[4] Deformnet (RGB-D) | 83.4 | 77.3 | 53.2 | 19.3 | 21.4 | 21.4 | 43.2 | 54.1 | -- | 4 |
[5] 6-Pack | 94.2 | -- | -- | -- | 33.3 | -- | -- | -- | -- | 10 |
[6] MSOS (RGB)** | 62.0 | 23.4 | 3.0 | -- | -- | -- | -- | -- | 9.6 | -- |
[6] MSOS (RGB-D)** | 81.6 | 68.1 | 32.9 | -- | 5.3 | 5.5 | -- | 24.7 | 26.5 | -- |
[7] FS-Net | 95.1 | 92.2 | 63.5 | -- | 28.2 | -- | -- | 60.8 | 64.6 | 20 |
[8] DualPose | -- | 79.8 | 62.2 | 29.3 | 35.9 | -- | 50.0 | 66.8 | -- | -- |
[9] CenterSnap (RGB-D) | 83.5 | 80.2 | -- | -- | 27.2 | 29.2 | -- | 58.8 | 64.4 | -- |
[9] CenterSnap-R (RGB-D) | 83.5 | 80.2 | -- | -- | 29.1 | 31.6 | -- | 64.3 | 70.9 | -- |
[10] ShAPO | 85.3 | 79.0 | -- | -- | 48.8 | 57.0 | -- | 66.8 | 78.0 | -- |
*Wrong reporting from CASS
**Different reporting of the results of the methods under comparison
Note: Authors tend to choose different thresholds when they need to compare their methods with others, and the choice often falls on more permissive thresholds as seen in many empty cells in the table. Some research papers also reports different number than the original papers, and the reason of this is not always explained. Ideally, it would be preferrable to motivate the choice behind the thresholds and evaluate on more restricitve thresholds depending on the application (e.g., grasping).
- E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, C. Rother, Learning 6D Object Pose Estimation Using 3D Object Coordinates, ECCV 2014
[paper]
Note: this is the first paper I am aware was defining the representation of an object as 3D object coordinates, and then estimating the object pose from this representation. NOCS seems to be inspired by this work but simply applied to Mask R-CNN as an additional branch.
-
Y. Lin, J. Tremblay, S. Tyree, P. A. Vela, S. Birchfield, Single-stage Keypoint-based Category-level Object Pose Estimation from an RGB Image, ICRA 2022
[paper] -
A. Grabner, P. M. Roth, V. Lepetit, 3D Pose Estimation and 3D Model Retrieval for Objects in the Wild, CVPR, 2018
[paper][webpage]
If you have any further enquiries, question, or comments, or you would like to file a bug report or a feature request, use the Github issue tracker.
This repository is licensed under the MIT License. To view a copy of this license, see LICENSE.