陈志泊,李钰曼,许福,冯国明,师栋瑜,崔晓晖.基于TextRank和簇过滤的林业文本关键信息抽取研究[J].农业机械学报,2020,51(5):207-214,172.
CHEN Zhibo,LI Yuman,XU Fu,FENG Guoming,SHI Dongyu,CUI Xiaohui.Key Information Extraction of Forestry Text Based on TextRank and Clusters Filtering[J].Transactions of the Chinese Society for Agricultural Machinery,2020,51(5):207-214,172.
摘要点击次数: 1466
全文下载次数: 614
基于TextRank和簇过滤的林业文本关键信息抽取研究   [下载全文]
Key Information Extraction of Forestry Text Based on TextRank and Clusters Filtering   [Download Pdf][in English]
投稿时间:2019-12-31  
DOI:10.6041/j.issn.1000-1298.2020.05.023
中文关键词:  林业文本  关键词抽取  TextRank  簇过滤  信息类型
基金项目:国家自然科学基金项目(61772078)和北京林业大学热点追踪项目(2018BLRD18)
作者单位
陈志泊 北京林业大学 
李钰曼 北京林业大学 
许福 北京林业大学 
冯国明 中国联合网络通信集团有限公司 
师栋瑜 中国电信系统集成有限责任公司 
崔晓晖 北京林业大学 
中文摘要:目前,获取林业文本关键信息存在2个问题:关键信息获取主要从关键词角度考虑,忽略了词语的信息类型;网络上的林业文本没有统一的记述结构,词语信息类型提取困难。为此,本文提出了基于改进TextRank和簇过滤的林业文本关键信息抽取方法,以“关键词+信息类型”两部分表示文本关键信息。首先,抽取关键词并进行Word2Vec向量化,然后通过构建融合词语特征值、边权值的图模型对TextRank进行改进,对经迭代收敛得到的稳定图进行归并聚类形成簇;然后,设计簇品质评价公式进行簇过滤,再次应用TextRank形成最终簇集合;最后,对簇进行信息类型标注。对于测试文本,通过比较关键词向量和簇心向量的距离获得词语的信息类型,将信息类型与关键词结合得到文本的关键信息。基于2000篇与林业政策新闻相关的林业文本进行实验,最终簇集合的紧密度为0.9680,间隔度为0.0572,综合评价指标为0.8871;对其中400篇文本进行关键词人工标注,将本文关键词抽取方法与TextRank、TF-IDF等6种算法进行比较,结果表明,本文方法在MRR、Bpref、准确率和综合评价指标上均获得了较好的效果,说明本文方法在提取林业文本关键词方面具有优势。
CHEN Zhibo  LI Yuman  XU Fu  FENG Guoming  SHI Dongyu  CUI Xiaohui
Beijing Forestry University;China United Network Communications Group Co., Ltd.,;China Telecom System Integration Co., Ltd.
Key Words:forestry text  keywords extraction  TextRank  clusters filtering  information types
Abstract:There are two main problems in obtaining key information of forestry text, firstly, the key information is mainly considered from the perspective of keywords, and the information types of words are neglected;secondly, there is no unified description structure for forestry text on the Internet, which makes it difficult to extract word information types. Through combining the two characteristics of “keywords+information types”, a method about forestry text key information extraction was proposed based on inproved TextRank and clusters filtering. The main contents were as follows: the first step was to extract the text keywords according to the keywords extraction formula. The second step was to characterize the keywords with Word2Vec vectorization. The third step was to improve the TextRank algorithm, mainly by merging the word features and introducing the edge weights to construct the graph model of the text. The fourth step was to obtain the stable graph structures through iterative convergence, and then merged them to form clusters. And the clusters’s quality was evaluated from three aspects: the uniformity of elements distribution, the size of the clusters, and the universality of the clusters. The fifth step was to form the final clusters’set in combination with the TextRank algorithm. The final step was to label the final clusters about information types. The data used in the experiments were 2000 forestry texts related to forestry policies and news. The experimental results showed that compactness of the final clusters’ set was 0.9680, the separation of the final clusters’ set was 0.0572, and the F1-measure of the final clusters’ set was 0.8871. It showed that the information types of the clusters can be clearly marked. For a text’s keywords, their information type was obtained by calculating the cosine similarity of the keywords’ vector and the clusters’ heart. The combination of keywords and information types constituted key information of a foresty text. Meanwhile, manually labeled 400 texts, comparing with the six algorithms such as TextRank, TF-IDF, this method achieved the better results in MRR, Bpref, accuracy, and F1-measure. It showed that this method had advantages in extracting forestry text keywords.

Transactions of the Chinese Society for Agriculture Machinery (CSAM), in charged of China Association for Science and Technology (CAST), sponsored by CSAM and Chinese Academy of Agricultural Mechanization Science(CAAMS), started publication in 1957. It is the earliest interdisciplinary journal in Chinese which combines agricultural and engineering. It always closely grasps the development direction of agriculture engineering disciplines and the published papers represent the highest academic level of agriculture engineering in China. Currently, nearly 8,000 papers have been already published. There are around 3,000 papers contributed to the journal each year, but only around 600 of them will be accepted. Transactions of CSAM focuses on a wide range of agricultural machinery, irrigation, electronics, robotics, agro-products engineering, biological energy, agricultural structures and environment and more. Subjects in Transactions of the CSAM have been embodied by many internationally well-known index systems, such as: EI Compendex, CA, CSA, etc.

   下载PDF阅读器