HOnnotate: A method for 3D Annotation of Hand and Object Poses
In our recent work, we develop a method for automatically annotating 3D hand-object poses of a sequence captured with single or multi camera setup. Using this method we create HO3D dataset, containing 3D annotations for more than 65 sequences captured with 10 different subjects and 10 objects. The dataset contains approximately 80K images and is roughly divided into 70K training images and 10K evaluation images. Through this challenge, we allow researchers to access our dataset and evaluate their hand pose estimation methods with different metrics. Note that this challenge only evaluates the accuracy of the estimated hand poses and not objects poses. The ground truth hand pose annotations for the evaluation split are withheld while the object pose annotations for training/evaluation split are made public. For more details visit our project page and the Github repo.
HO3D Challenge Benchmark:
||Mesh Error (cm) (procrustes alignement)
||Joint error (cm)
 Hampali et al. HOnnotate: A method for 3D Annotation of Hand and Object Poses. CVPR'20
 Hasson et al. Learning Joint Reconstruction of Hands and Manipulated Objects. CVPR'19
The methods are evaluated based on the accuracy of the estimated 3D joint locations and the hand mesh vertices. The pose of the hand can only be estimated upto a scale from single RGB image. Hence, in order to have the metrics independent of the scale, we first align the predicted hand pose with the ground truth hand pose using two different alignment methods as in  and . The metrics are then calculated on the hand mesh and 3D joint locations separately.
- Procrustes alignment: This method is used to align the estimated 3D joint locations and hand mesh vertices with the ground truth. Note that procrustes alignment changes the global rotation, translation and scale of the estimated pose to match the ground truth poses.
- Root joint and global scale alignment: This method is used to align the only the estimated 3D joint locations with the ground truth. Refer  for more details. This type of alignment only changes the global scale and translation of the estimated poses to match the ground truth poses.
- Mesh error: Average Euclidean distance between predicted and ground truth meshes.
- F-Score: The harmonic mean between recall and precision between two meshes given a distance threshold.
- Mean joint error: Average Euclidean distance between predicted and ground truth 3D joint locations.
- AUC: Percentage of correct keypoints (PCK) curve in an interval from 0cm to 5cm with 100 equally spaced thresholds
- Ordering of the joints: The ordering of the 21 joints in our training and evaluation split follow the same convention as in MANO model. The 21 joints follow the ordering as shown in the image below. Wrist joint is used as the root joint. Be sure to follow this ordering in your submissions!
- Coordinate system: All annotations assume openGL coordinate system i.e., hand/objects are along negative z-axis in a right-handed coordinate system with origin at camera optic center.
- Additional information for evaluation split: We provide the following additional information for the evaluation split images. The participants are free to use them in their submissions.
- 2D bounding box for hand: For each image in the evaluation split, a tight bounding box for the hand is provided.
- Root joint 3D location: For each image in the evaluation split, the root 3D joint location are provided. This helps in resolving the scale/depth ambiguity. Note that our evaluation scripts invariably resolve the scale ambiguity through alignments described earlier.
- Mesh error computation: We use MANO model to obtain the hand mesh which contains 778 vertices. As a result the mesh error metric requires the hand mesh to be obtained from the same MANO model. If the number of vertices in the submission is other than 778, the mesh error metric will return -1
- C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and T. Brox. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images. In ICCV, 2019.
- C. Zimmermann and T. Brox. Learning to Estimate 3D Hand Pose from Single RGB Images. In ICCV, 2017.
- Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG), 36(4):78, 2017
- Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG), 36(6):245, 2017
Terms and Conditions
The download and use of the dataset is released for academic research only and it is free to researchers from educational or research institutes for non-commercial purposes. When downloading the dataset you agree to (unless with expressed permission of the authors): not redistribute, modificate, or commercial usage of this dataset in any way or form, either partially or entirely. If using this dataset, please cite the corresponding paper.