Separating metagenomic short reads into genomes via clustering

Olga Tanaseichuk; James Borneman; Tao Jiang

首页> 外文期刊>Algorithms for Molecular Biology >Separating metagenomic short reads into genomes via clustering

【24h】

Separating metagenomic short reads into genomes via clustering

机译：通过聚类将宏基因组短读片段分离到基因组中

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels. Results In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/～tanaseio/toss.htm webcite . Conclusions Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

机译：背景技术宏基因组学方法允许同时对环境样品中的所有基因组进行测序。这导致了高复杂性的数据集，其中除了重复和测序错误外，基因组的数量及其丰度比是未知的。最近开发的下一代测序（NGS）技术显着提高了测序效率和成本。另一方面，它们导致较短的读取，这使得从不同物种读取的分离更加困难。在用于宏基因组分析的现有计算工具中，有基于相似度的方法使用参考数据库来对齐读段，以及基于组合物的方法使用组成模式（即短单词或l-mers的频率）来聚类阅读。基于相似度的方法无法在没有密切参考的情况下对未知物种的读物进行分类（这构成了大部分读物）。由于合成模式仅保留在非常大的片段中，因此基于合成的工具无法用于非常短的读取，这随着NGS的发展而成为重要的限制。最近提出的算法AbundanceBin引入了另一种方法，该方法可根据测序的基因组的预测丰度来对读数进行分类。但是，它不会将读物与相似丰度水平的基因组分开。结果在这项工作中，我们提出了一种两阶段启发式算法，用于从宏基因组数据集中的不同基因组中分离短配对末端读段。我们观察到，当l足够大时，大多数l-mer属于独特的基因组。该算法的第一阶段产生了L-mer簇，每个簇都属于一个基因组。在第二阶段，基于1聚体重复信息合并聚类。这些最终簇用于分配读取。该算法可以处理非常短的读取和排序错误。它最初是为具有相似丰度水平的基因组设计的，然后扩展为处理任意丰度比。可以从http://www.cs.ucr.edu/～tanaseio/toss.htm网站免费下载该软件。结论我们对涉及不同系统发生距离的物种的大量模拟宏基因组数据集的测试表明，如果共同重复的数目小于基因组特异性重复的数目，则可以分离基因组。对于此类基因组，我们的方法可以高精度和高灵敏度分离NGS读数。

著录项

来源
《Algorithms for Molecular Biology》 |2012年第1期|共页
作者
Olga Tanaseichuk; James Borneman; Tao Jiang;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类分子生物学;
关键词

相似文献

外文文献
中文文献
专利

1. Complete 4.55-Megabase-Pair Genome of “Candidatus Fluviicola riflensis,” Curated from Short-Read Metagenomic Sequences [J] . Jillian F. Banfield, Karthik Anantharaman, Kenneth H. Williams, Genome Announcements . 2017,第47期

机译：从短读的元基因组序列中筛选出的“ Candidatus Fluviicola riflensis”完整的4.55碱基对基因组。
2. Individual genome assembly from complex community short-read metagenomic datasets [J] . Luo C., Tsementzi D., Kyrpides N.C., The ISME journal emultidisciplinary journal of microbial ecology . 2012,第4期

机译：来自复杂社区的短基因组学数据集的个体基因组组装
3. MTR: taxonomic annotation of short metagenomic reads using clustering at multiple taxonomic ranks [J] . Gori, Fabio, Folino, Gianluigi, Jetten, Mike S. M., Bioinformatics . 2011,第2期

机译：MTR：在多个分类学等级上使用聚类对短宏基因组读物进行分类学注释
4. Separating Metagenomic Short Reads into Genomes via Clustering (Extended Abstract) [C] . Olga Tanaseichuk, James Borneman, Tao Jiang Algorithms in bioinformatics . 2011

机译：通过聚类将超基因组短片段读入基因组中（扩展摘要）
5. Scaling short read de novo DNA sequence assembly to gigabase genomes. [D] . Cook, Jeffrey J. 2011

机译：将短读从头DNA序列组装扩展到gigabase基因组。
6. Separating metagenomic short reads into genomes via clustering [O] . Olga Tanaseichuk, James Borneman, Tao Jiang 2019

机译：通过聚类将宏基因组短读分为基因组
7. Separating metagenomic short reads into genomes via clustering [O] . 2012

机译：通过聚类将宏基因组短读分为基因组

Separating metagenomic short reads into genomes via clustering

摘要

著录项

相似文献

相关主题

期刊订阅