Abstract:Aiming to address the challenges of greenhouse mobile robot localization and scene recognition in highly dynamic environments with visually similar scenes, a novel scene recognition model was proposed based on local feature selection and aggregation. The model employed a pre-trained vision Transformer (DINOv2) as its backbone network to extract local image features and introduced a learnable query-based feature selection and aggregation strategy to generate discriminative global descriptors. By leveraging cross-attention mechanisms, the model selectively aggregated the most informative local features into compact global representations. Furthermore, a hybrid loss function combining contrastive learning and triplet learning was applied to optimize the recognition model. A comprehensive greenhouse scene dataset containing 2100 scenes and 25000 images was constructed, covering multiple challenging factors such as illumination variations, viewpoint changes, distance scaling, and temporal crop growth. Experimental results demonstrated that the proposed model achieved Top-1 recall rates (R@1) of 88.79%, 96.49% (R@5), and 97.96% (R@10) on the collected dataset, outperforming state-of-the-art scene recognition benchmarks, including NetVLAD, GeM, CosPlace, EigenPlaces, MixVPR, and SALAD by 23.70, 19.24, 10.64, 3.30, 3.90, and 0.44 percentage points in R@1, respectively. The model exhibited strong robustness under varying illumination conditions (R@1 fluctuation <5 percentage points), moderate viewpoint changes (93.12% accuracy within 15° deviation), and scaling variations (63.94% R@1 at 2×distance). However, performance declined under extreme viewpoint/distance changes and long-term crop growth variations (61.14% R@1 after 5 days). Real-world validation on a greenhouse mobile robot confirmed the model’s practicality, achieving an average recognition rate of 85.88%. The proposed learnable query-based feature aggregation mechanism, combined with the carefully selected feature extraction backbone, significantly improved recognition accuracy in greenhouse environments. This framework can provide a viable technical solution for vision systems in agricultural mobile robotics.