...
首页> 外文期刊>Algorithms for Molecular Biology >Separating metagenomic short reads into genomes via clustering
【24h】

Separating metagenomic short reads into genomes via clustering

机译:通过聚类将宏基因组短读片段分离到基因组中

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels. Results In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/~tanaseio/toss.htm webcite . Conclusions Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.
机译:背景技术宏基因组学方法允许同时对环境样品中的所有基因组进行测序。这导致了高复杂性的数据集,其中除了重复和测序错误外,基因组的数量及其丰度比是未知的。最近开发的下一代测序(NGS)技术显着提高了测序效率和成本。另一方面,它们导致较短的读取,这使得从不同物种读取的分离更加困难。在用于宏基因组分析的现有计算工具中,有基于相似度的方法使用参考数据库来对齐读段,以及基于组合物的方法使用组成模式(即短单词或l-mers的频率)来聚类阅读。基于相似度的方法无法在没有密切参考的情况下对未知物种的读物进行分类(这构成了大部分读物)。由于合成模式仅保留在非常大的片段中,因此基于合成的工具无法用于非常短的读取,这随着NGS的发展而成为重要的限制。最近提出的算法AbundanceBin引入了另一种方法,该方法可根据测序的基因组的预测丰度来对读数进行分类。但是,它不会将读物与相似丰度水平的基因组分开。结果在这项工作中,我们提出了一种两阶段启发式算法,用于从宏基因组数据集中的不同基因组中分离短配对末端读段。我们观察到,当l足够大时,大多数l-mer属于独特的基因组。该算法的第一阶段产生了L-mer簇,每个簇都属于一个基因组。在第二阶段,基于1聚体重复信息合并聚类。这些最终簇用于分配读取。该算法可以处理非常短的读取和排序错误。它最初是为具有相似丰度水平的基因组设计的,然后扩展为处理任意丰度比。可以从http://www.cs.ucr.edu/~tanaseio/toss.htm网站免费下载该软件。结论我们对涉及不同系统发生距离的物种的大量模拟宏基因组数据集的测试表明,如果共同重复的数目小于基因组特异性重复的数目,则可以分离基因组。对于此类基因组,我们的方法可以高精度和高灵敏度分离NGS读数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号