森林生态站大数据快速存储与索引方法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

中央高校基本科研业务费专项资金项目(BLX201923)和国家自然科学基金项目(32071775)


Fast Storage and Indexing Method of Big Data in Forest Ecological Station
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对森林生态站中大量图像、视频、GIS数据等非结构化数据以及生态指标等结构化数据存储效率低、检索性能差的问题,提出了基于Hadoop和HBase的森林生态站大数据存储框架。基于所提出的框架,给出了森林生态数据存储业务流程,并对森林生态大数据平台涉及的核心技术进行了优化:①设计预分区算法保证数据在集群中均匀分布。②根据生态数据特点科学设计了RowKey,实现生态数据的快速检索。③针对原生HBase不支持多条件查询问题,设计基于索引数据和服务器性能评估的ElasticSearch索引分片放置策略,以此基于ElasticSearch的二级非主键索引技术优化多条件检索HBase生态数据库。④针对生态站海量小图像存储困难问题,提出基于数据站点及时间关联性的打包合并策略。⑤解析GIS数据使之进行高效存储。通过实验对以上理论进行验证。结果表明,ElasticSearch索引分片放置策略比默认分片策略的查询时间平均减少20 ms,比基于改变ElasticSearch评分策略的查询时间平均减少20 ms。结构化数据规模为1×108条时,系统的检索时间为1.045 s,比原生HBase检索速度提升3.99倍,在非结构化数据为1×107条时,采用数据站点及时间关联性的打包小图像策略是基于SequenceFile合并效率的1.15倍,是原生HBase的1.79倍;在1×104次并发用户的情况下,优化后的每秒查询数是原来的1.88倍,每秒吞吐量是优化前的1.74倍,系统响应时间比优化前降低69.5%。结果表明,本文所提出的方案在集群负载均衡、海量结构化和非结构化数据检索效率以及系统吞吐量等方面都有了明显的性能提升,为森林生态数据的存储和管理提供了必要的理论基础和技术实现。

    Abstract:

    Aiming at the problems of low storage efficiency and poor retrieval performance of a large number of unstructured data such as images, videos, GIS data and ecological indicators in the forest ecological station, a forest ecological station big data storage framework was proposed based on Hadoop and HBase. Based on the proposed framework, the business process of forest ecological data storage was given and the core technologies involved in the forest ecological big data platform was optimized.A pre-partitioning algorithm was designed to ensure that the data was evenly distributed in the cluster. According to the characteristics of ecological data, the RowKey was scientifically designed to achieve rapid retrieval of ecological data. Aiming at the problem that native HBase did not support multi-condition query, an ElasticSearch index shard placement strategy was designed based on index data and server performance evaluation, and the multi-condition search HBase ecological database was optimized based on ElasticSearch's secondary non-primary key index technology. In view of the difficulty of storing large amounts of small pictures in the ecological station, a package and merge strategy was proposed based on data sites and time relevance. GIS data was analyzed for efficient storage. The above theory was verified through experiments. The results showed that the ElasticSearch index shard placement strategy reduced the query time by an average of 20 ms compared with the default shard strategy. The average query time was reduced by 20 ms compared with that based on changing the ElasticSearch scoring policy. When the structured data size was 1×108, the retrieval time of the system was 1.045 s, which was 3.99 times faster than the native HBase retrieval, and when the unstructured data was 1×107 pieces, the based on data site and time correlation package small picture strategy was 1.15 times that of SequenceFile-based merging efficiency and 1.79 times that of native HBase.In the case of 1×104 concurrent users, after optimization, the number of queries per second was 1.88 times as much as before, the throughput per second was 1.74 times as much as before, and the system response time was 69.5% lower than that before optimization. From the above results, it can be seen that the solution proposed had significant performance improvements in cluster load balancing, massive structured and unstructured data retrieval efficiency, and system throughput, which provided the necessary theoretical foundation and technical realization for the storage and management of forest ecological data.

    参考文献
    相似文献
    引证文献
引用本文

王新阳,贾相宇,陈志泊,崔晓晖,许福.森林生态站大数据快速存储与索引方法[J].农业机械学报,2021,52(8):195-204,212.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-02-08
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2021-08-10
  • 出版日期: