Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

Umberto Ferraro Petrillo; Mara Sorella; Giuseppe Cattaneo; Raffaele Giancarlo; Simona E. Rombo

首页> 外文期刊>BMC Bioinformatics >Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

【24h】

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

机译：分析基因组序列的大数据集：k-mer统计的快速和可扩展集合

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.

机译：由于下一代测序技术产生的大量数据，基于MapReduce编程范式的分布式方法已经开始在生物信息域中提出。然而，在效率和有效性方面，使用MapReduce和相关大数据技术和框架（例如，Apache Hadoop和Spark）并不一定会产生令人满意的结果。我们讨论了分布式和大数据管理技术的发展如何影响了对生物序列大型数据集的分析。此外，我们展示了如何选择不同参数配置和对所考虑的特定框架的仔细工程的选择可能是至关重要的，以实现良好的性能，尤其是在非常大量的数据上。我们选择K-MERS计数作为我们分析的案例研究，并作为实施Fastkmer的框架，这是一种从大型生物序列中提取K-MER统计的新方法，具有任意值的K. FastKmer最相关的贡献之一是引入用于平衡计算集群的节点的统计聚合工作负载，以克服数据偏移，同时允许完全开发基础分布式架构。我们还提出了比较实验分析的结果，表明我们的方法目前是基于大数据技术的方法，同时表现出非常好的可扩展性。我们提供了证据表明，只有当考虑框架的架构细节和特殊方面都要考虑到算法的设计和实现，才有才能生产诸如生物学序列的大数据集的技术的水平或火花的使用。

著录项

来源
《BMC Bioinformatics》 |2019年第s4期|共14页
作者
Umberto Ferraro Petrillo; Mara Sorella; Giuseppe Cattaneo; Raffaele Giancarlo; Simona E. Rombo;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
Apache SparkDistributed computingPerformance evaluationkmer counting;

机译：Apache SparkDistRibuted ComputingPerformance EvaluationKmer Counting;

相似文献

外文文献
中文文献
专利

1. iMOKA: k-mer based software to analyze large collections of sequencing data [J] . Claudio Lorenzi, Sylvain Barriere, Jean-Philippe Villemin, Genome Biology . 2020,第1期

机译：Imoka：基于K-MEL的软件分析大量测序数据
2. Scaling statistical multiple sequence alignment to large datasets [J] . Michael Nute, Tandy Warnow BMC Genomics . 2016,第10期

机译：将统计多序列比对扩展到大型数据集
3. Statistical Approaches to Detecting and Analyzing Tandem Repeats in Genomic Sequences [J] . Anisimova Maria, Pe?erska Julija, Schaper Elke Frontiers in Bioengineering and Biotechnology . 2015,第2期

机译：检测和分析基因组序列中串联重复序列的统计方法
4. FastNet: Fast and Accurate Statistical Inference of Phylogenetic Networks Using Large-Scale Genomic Sequence Data [C] . Hussein A. Hejase, Natalie VandePol, Gregory M. Bonito, International workshop on comparative genomics . 2018

机译：FastNet：使用大规模基因组序列数据对系统发生网络进行快速而准确的统计推断
5. Statistical and Computational Methods for Analyzing and Visualizing Large-Scale Genomic Datasets [D] . Kwong, Alan M. 2020

机译：用于分析和可视化大规模基因组数据集的统计和计算方法
6. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics [O] . Umberto Ferraro Petrillo, Mara Sorella, Giuseppe Cattaneo, 2019

机译：分析基因组序列的大型数据集：快速且可扩展地收集k-mer统计数据
7. Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis [O] . Huiguang Yi, Yanling Lin, Chengqi Lin, 2021

机译：KSSD：k-mer子串空间采样的序列维数减少支持实时大规模数据集分析

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

摘要

著录项

相似文献

相关主题

期刊订阅