...
首页> 外文期刊>BMC Bioinformatics >Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
【24h】

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

机译:分析基因组序列的大数据集:k-mer统计的快速和可扩展集合

获取原文
           

摘要

Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.
机译:由于下一代测序技术产生的大量数据,基于MapReduce编程范式的分布式方法已经开始在生物信息域中提出。然而,在效率和有效性方面,使用MapReduce和相关大数据技术和框架(例如,Apache Hadoop和Spark)并不一定会产生令人满意的结果。我们讨论了分布式和大数据管理技术的发展如何影响了对生物序列大型数据集的分析。此外,我们展示了如何选择不同参数配置和对所考虑的特定框架的仔细工程的选择可能是至关重要的,以实现良好的性能,尤其是在非常大量的数据上。我们选择K-MERS计数作为我们分析的案例研究,并作为实施Fastkmer的框架,这是一种从大型生物序列中提取K-MER统计的新方法,具有任意值的K. FastKmer最相关的贡献之一是引入用于平衡计算集群的节点的统计聚合工作负载,以克服数据偏移,同时允许完全开发基础分布式架构。我们还提出了比较实验分析的结果,表明我们的方法目前是基于大数据技术的方法,同时表现出非常好的可扩展性。我们提供了证据表明,只有当考虑框架的架构细节和特殊方面都要考虑到算法的设计和实现,才有才能生产诸如生物学序列的大数据集的技术的水平或火花的使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号