Abstract:Scene recognition could be used as an alternative for spatial positioning in greenhouse environments, and it was also one of the important functions of the visual system of intelligent agricultural machinery equipment. Addressing the issue that scene recognition paradigms based on feature clustering could not adapt to the recognition of greenhouse scenes with high dynamic changes and high similarity, a greenhouse scene recognition method based on deep feature aggregation was proposed. This method, grounded on a pre-trained visual transformer network, extracted local features from scene images. It applied the global receptive field characteristics of multi-layer perceptron, took into account the spatial relationships of local features, fused the local features of the images, and generated global descriptors for the scene images. With the goal of minimizing multi-similarity loss as the optimization objective, a greenhouse scene recognition model was constructed. The test results indicated that the R@ 1 ( top 1 recall rate), R @ 5, and R @ 10 of the model’s scene recognition reached 78.43% , 89.21% , and 92.47% , respectively, and it possessed high scene recognition accuracy. The proposed feature mixing method based on multi-layer perceptron was proven effective, with an improvement of 8.01 percentages in R@ 1 compared with that of feature aggregation using pooling operations. The model demonstrated a certain robustness to changes in lighting conditions, with the R@ 1 metric decreased by no more than 4.00 percentages under strong and weak lighting conditions compared with that under normal medium lighting conditions. Changes in camera angle and sampling distance also impacted the model’s recognition performance, with a decline of 6.61 percentages for angle changes within 20 degrees, and a drop of 17.87 percentages for distance changes within twice the original distance. Compared with the existing scene recognition benchmark methods, including NetVLAD, GeM, Patch-NetVLAD, MultiRes-NetVLAD, and MixVPR, the R@ 1 of proposed model was improved by 7.82, 6.59, 3.56, 4.14, and 1.88 percentages, respectively, demonstrating a significant performance enhancement on the greenhouse scene recognition task. The image global feature aggregation method based on multi-layer perceptron constructed was able to generate reliable global descriptors for greenhouse scene recognition, and exhibited robustness to changes in lighting, viewpoint, distance, and time. The research findings would provide technical references for the design of visual systems for intelligent agricultural machinery.