Abstract:Accurate remote sensing image segmentation of cultivated land (CLRSIS) is crucial for yield prediction, agricultural management, and national food security. However, it remains challenging due to the high-resolution, large size and various remote sensing farmland images with irregularly boundaries and complex background. Convolutional neural networks(CNNs) and Transformers have been widely applied to RSI segmentation, but both of them have limited ability to handle long-range dependencies because of inherent locality or computational complexity. Aiming at the limitation of CNNs and Transformers, and the technical difficulties in CLRSIS, a multi-scale attention visual Mamba U-Net (MSAVM-UNet) model for CLRSIS was proposed. This model achieved performance breakthroughs through three innovative modules: firstly, modified visual state space module (MVSS) adopted a bidirectional selective scanning mechanism, enabling long-range dependency modeling while maintaining linear computational complexity. Secondly, channel-aware attention visual state-space (CAAVSS) effectively enhanced the discrimination between cultivated land and background features through dynamic spectral-spatial feature recalibration. Finally, multi-scale feature aggregation module (MSAA) built a cross-level feature pyramid to achieve multi-granularity information fusion. Experiments on public cultivated land datasets showed that this method was significantly superior to existing methods in terms of segmentation accuracy and computational efficiency, with the average segmentation precision accuracy and DSC achieving 85.60% and 84.46%, respectively. The research result can provide reliable technical support for the precise monitoring of cultivated land in smart agriculture.