Abstract:In greenhouse environments, fast, high-precision, and low-cost acquisition of scene depth information is crucial for agricultural machine vision systems in tasks such as tomato phenotype analysis, autonomous harvesting and multimodal joint segmentation. An attention-embedded RGB-to-depth conversion network (RGB to depth conversion network,RDCN) for monocular depth estimation was proposed, addressing issues in traditional algorithms such as insufficient feature extraction capability of encoders, low depth estimation accuracy, and blurred boundaries. Firstly, ResNext101 was employed to replace the original ResNet101 backbone network, extracting feature maps from different levels and integrating them into the Laplacian pyramid branches. This approach emphasized the scale differences of features and enhances the depth and breadth of feature fusion. To enhance the models capacity for capturing global information and contextual interactions, a shuffle attention module (SAM) was introduced. This module also helped minimize the loss of local detail information caused by the down-sampling process. This module also mitigated the loss of local detail information caused by the downsampling process. Secondly, to address the issue of blurred boundaries in the predicted depth maps, a depth refinement module (DRM) was embedded to capture depth variations near object edges in the predicted feature maps. For the study, an RGBD image acquisition platform for tomatoes was constructed in a daylight greenhouse environment using an Azure Kinect DK depth camera. To ensure diversity in the dataset, images were collected at different times of the day based on varying light intensities in the greenhouse environment. The training set was then augmented by using three methods: horizontal mirroring, random rotation, and color jittering, resulting in a total of 8515 aligned RGBD image sets of tomatoes. Experimental results indicated that by introducing the shuffle attention module and the depth refinement module, the model achieved accurate depth information prediction in greenhouse scenes. Compared with the baseline model, the visualized depth maps generated by the network demonstrated global completeness and clarity, with more texture details, especially in regions with complex geometries and significant depth variations, exhibiting superior visual effects. Experimental results showed that, compared with the baseline model, RDCN reduced the mean relative error, root mean square error, log root mean square error, and log mean error on the test set by 20.5%, 10.3%, 8.3%, and 21.8%, respectively. Additionally, accuracy under the 1.25, 1.252, and 1.253 thresholds was improved by 3.2%, 1.2%, and 1.0%, respectively. Moreover, the depth images generated by the network were visually complete and clear, with abundant texture details. Studies showed that RDCN can obtain highquality depth information from RGB data in greenhouse environments, providing technical support for agricultural machine navigation in greenhouse scenarios using monocular sensors, as well as for the application of depth images in multi-modal tasks.