基于Spark框架XGBoost的林业文本并行分类方法研究
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金项目(61772078)和北京林业大学热点追踪项目(2018BLRD18)


Parallel Forestry Text Classification Technology Based on XGBoost in Spark Framework
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对当前“互联网+”技术与林业的交叉融合,涌现出海量待挖掘的涉林文本,而林业文本分类的相关研究尚不成熟的问题,使用网络爬虫技术面向互联网采集涉林文本,基于丰富的语料重新构建分类标签,提出基于Spark计算框架的XGBoost并行化方法,对林业文本进行分类。经由交叉验证,构建的XGBoost并行分类算法准确率为0.9234,在各类别中最低F1为0.8604,最高为0.9984;其在2.1万条、4.2万条、8.4万条数据集上的训练加速比分别为2.13、3.47、3.82。结果表明,基于该标签设定的分类模型对现存互联网中涉林文本的适应性较好;Spark环境下实现的XGBoost并行化算法的准确率显著优于其他4种机器学习(朴素贝叶斯、GBDT决策树、BP神经网络和ELM神经网络算法)的并行化算法,算法执行效率远高于单机版本,且数据量越大,其加速比越高,能有效应对海量林业文本的实时、准确分类。

    Abstract:

    At present, the cross-integration of computer technology and forestry field had emerged a large number of forestry texts to be explored, and the shortcomings of related research could be summarized in two aspects: the classification labels in the existing classification system were set unscientific, leading to the classification model lacking of ability to classify the texts on net;the classification algorithm was mostly trained in the single-machine environment without considering its parallelism, then the algorithm could not deal with the actual large-scale data classification problem. Therefore, it was pretty realistic and urgency to design more scientific classification labels and classify forestry texts based on Spark framework. A new crawler technology was used to collect forestry-related texts, and re-construct labels by referring to the existing information retrieval system of forestry to improve the adaptability of classification models. Then the XGBoost parallelization implementation method was realized based on Spark, which completed the computing of training and prediction by RDD program mode. Through cross-validation method, the accuracy of XGBoost parallel algorithm could reach 0.9234. The lowest F1-measure value was 0.8604 and the highest was 0.9984. By training on the 21 thousand, 42 thousand and 84 thousand data sets, the speedup ratios could reach 2.13, 3.47 and 3.82, respectively. The results showed that the new classification labels were set more scientific, and the system had better adaptability to the forestry-related texts on the existing internet. The precision and recall values of the XGBoost algorithm were significantly better than the four kinds of parallel algorithms based on Spark which included NB, gradient boosting decision tree, back propagation neural network, extreme learning machine and ran more effective than the stand-alone version. And with the increase of the data number, the acceleration ratio could be improved, which meant it was pretty useful to deal with the problem about the real-time and accurate classification of massive forestry texts.

    参考文献
    相似文献
    引证文献
引用本文

崔晓晖,师栋瑜,陈志泊,许福.基于Spark框架XGBoost的林业文本并行分类方法研究[J].农业机械学报,2019,50(6):280-287. CUI Xiaohui, SHI Dongyu, CHEN Zhibo, XU Fu. Parallel Forestry Text Classification Technology Based on XGBoost in Spark Framework[J]. Transactions of the Chinese Society for Agricultural Machinery,2019,50(6):280-287

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-03-02
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2019-06-10
  • 出版日期: