Abstract:Aiming to address the challenge of reliably acquiring 3D pose information of truss tomatoes for autonomous harvesting robots under conditions of severe occlusion and strong light interference in greenhouses, an improved 3D keypoint estimation model named TomatoPose3D was proposed. During the training phase, the model incorporated joint constraints between RGB images and 3D ground-truth keypoints to enhance structural consistency and generalization capability. In the inference phase, the model can end-to-end regress 3D keypoint coordinates from a single RGB image, thereby avoiding localization failures caused by sparse or missing point clouds. Based on the RTMPose3D baseline, the improved model introduced the global structure-aware MobileVit Block and the distribution-aware coordinate representation of keypoints (DARK) decoding strategy, improving localization accuracy while maintaining a lightweight architecture. Comparative experiments in greenhouse scenarios indicated that TomatoPose3D improved the PCK @ 0.05 score by 5.18 and 9.98 percentage points compared with RTMPose3D and SimpleBaseline3D, respectively. Without the assistance of depth information, the model achieved localization accuracy comparable to RGB D projection-based methods while demonstrating superior robustness. Furthermore, the model was deployed on an industrial-grade embedded platform accelerated by TensorRT, achieving an end-to-end inference speed of 37 f/s, which met the real-time spatial visual perception requirements of harvesting robots.