基于BERT-BiLSTM-CRF模型的畜禽疫病文本分词研究
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

云南省重大科技专项计划项目(202102AE090039)、北京市农林科学院创新能力建设专项(KJCX20230204)和北京市数字农业创新团队建设项目(BAIC10-2023)


Text Word Segmentation of Livestock and Poultry Diseases Based on BERT-BiLSTM-CRF Model
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network, BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field, CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。

    Abstract:

    The diagnosis, prevention and control of livestock and poultry diseases is of great significance to ensure the healthy development of animal husbandry in China. Based on natural language processing, the word segmentation effect of livestock and poultry disease texts was improved to improve the diagnosis level of livestock and poultry diseases. In order to deal with the problems of lacking text corpus in livestock and poultry diseases, and a large number of out of vocabulary words contained in the texts, such as epidemic names and phrases, a word segmentation model was proposed based on BERT-BiLSTM-CRF combined with dictionary matching. Taking sheep diseases as the research object, the text datasets of common diseases were constructed combined with the general corpus PKU, and the text vectorizations were processed by BERT pre-trained language model. Then the context semantic features were obtained through the bidirectional long short-term memory network (BiLSTM), and globally optimal label sequences were outputted by conditional random field (CRF). Based on this, dictionary matching was refined by adding a dictionary in the field of livestock and poultry diseases after the CRF layer, which reduced the ambiguity segmentation caused by the epidemic names and phrases in the process of word segmentation, and the accuracy of word segmentation was further improved. Results showed that the F1 value of the BERT-BiLSTM-CRF model combined with dictionary matching on the text datasets of sheep common diseases was 96.38%, which was increased by 11.01, 10.62, 8.3 and 0.72 percentage points, compared with that of jieba word segmentation, BiLSTM-Softmax model, BiLSTM-CRF model, and BERT-BiLSTM-CRF model that did not combine with dictionary matching, respectively, which verified the effectiveness of BERT-BiLSTM-CRF. Compared with a single corpus, the mixed corpus combined with the general corpus PKU and the text datasets of sheep common diseases could accurately divide the professional terms of livestock and poultry diseases and common words in the texts of diseases at the same time, the F1 values of the general corpus and the text datasets of diseases were more than 95%, which illustrated its better generalization ability. BERT-BiLSTM-CRF model can be effectively used for word segmentation of texts on livestock and poultry diseases.

    参考文献
    相似文献
    引证文献
引用本文

余礼根,郭晓利,赵红涛,杨淦,张俊,李奇峰.基于BERT-BiLSTM-CRF模型的畜禽疫病文本分词研究[J].农业机械学报,2024,55(2):287-294. YU Ligen, GUO Xiaoli, ZHAO Hongtao, YANG Gan, ZHANG Jun, LI Qifeng. Text Word Segmentation of Livestock and Poultry Diseases Based on BERT-BiLSTM-CRF Model[J]. Transactions of the Chinese Society for Agricultural Machinery,2024,55(2):287-294.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-11-13
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2024-02-10
  • 出版日期:
文章二维码