首页> 美国卫生研究院文献>other >CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
【2h】

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

机译:CLUSTOM-CLOUD:基于内存数据网格的软件用于在云环境中对16S rRNA序列数据进行聚类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at .
机译:高通量测序可以产生与环境样品中存在的不同生物相对应的成千上万的16S rRNA序列读数。通常,对生物信息学中微生物多样性的分析始于预处理,然后将16S rRNA读数聚类为相对较少的操作分类单位(OTU)。 OTU是微生物多样性的可靠指标,并大大加快了下游分析时间。但是,通常比贪婪启发式算法更准确的现有分层聚类算法在处理大型序列数据集时会遇到困难。为了跟上测序数据的快速增长,我们介绍了CLUSTOM-CLOUD,这是第一个基于内存数据网格(IMDG)技术的分布式序列聚类程序-一种分布式数据结构,用于将所有数据存储在内存中。多个计算节点。 IMDG技术比其祖先CLUSTOM更好地帮助CLUSTOM-CLOUD增强了处理大型数据集的能力和计算可伸缩性,同时保持了较高的准确性。使用小型实验室集群(10个节点)并在Amazon EC2云计算环境下,在已发布的16S rRNA人类微生物组序列数据集上评估了CLUSTOM-CLOUD的集群速度。在实验室环境下,无论人类微生物组数据的复杂性如何,仅需约3个小时即可处理200 K读数的数据集。反过来,当利用Amazon EC2云计算环境中的20、30和40个节点时,在大约20、14和11小时内处理了100万次读取。运行时间评估表明,CLUSTOM-CLOUD可以处理比CLUSTOM更大的序列数据集,并且还是可伸缩的分布式处理系统。使用模拟社区的16S rRNA焦磷酸序列进行的比较准确性测试表明,CLUSTOM-CLOUD的准确性高于DOTUR,mothur,ESPRIT-Tree,UCLUST和Swarm。 CLUSTOM-CLOUD是用JAVA编写的,可在上免费获取。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号