Abstract:Data quality issues are the bottleneck hindering the development of agricultural machinery big data platforms. The existing data cleaning algorithms are not suitable for large-scale, multi-source heterogeneous, high-dimensional, and strong spatiotemporal correlation of agricultural machinery real-time streaming data. To this end, the source and characteristics of the abnormal data of agricultural machinery in complex environments were analyzed, the detection and correction technology of abnormal data was studied, and an online cleaning method for agricultural machinery operation data based on sliding window mechanism was proposed. The method determined abnormal data based on the principle of variance constraint; generated preliminary candidate data based on the principle of minimum change; based on the time correlation of data, the final repair value was obtained through AR and ARX model optimization; relying on the Flink distributed computing platform, it was suitable for large data throughput and high concurrency of agricultural machinery. The validity of the algorithm was verified based on the agricultural machinery operation data of a certain province. The results showed that when the amount of data reached 1×10 5 and the proportion of abnormal data was 5%, the abnormal recognition rate of the algorithm reached 0.94, and the root mean square error was smaller than that of the existing cleaning algorithm. The experiment was designed based on the Box-Behnken method, and the regression model was obtained through response surface analysis to study the influence of algorithm parameters on the root mean square error and time. The hybrid genetic algorithm based on binary coding optimized the parameters, and the optimized parameter combination can make the root mean square error of the algorithm reach 0.16 and the running time reach 0.13s. The data cleaning method can provide high-quality data support for the real-time processing of the agricultural machinery big data platform.