首页> 美国卫生研究院文献>other >CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

【2h】

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

机译：CLUSTOM-CLOUD：基于内存数据网格的软件用于在云环境中对16S rRNA序列数据进行聚类

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at .

机译：高通量测序可以产生与环境样品中存在的不同生物相对应的成千上万的16S rRNA序列读数。通常，对生物信息学中微生物多样性的分析始于预处理，然后将16S rRNA读数聚类为相对较少的操作分类单位（OTU）。 OTU是微生物多样性的可靠指标，并大大加快了下游分析时间。但是，通常比贪婪启发式算法更准确的现有分层聚类算法在处理大型序列数据集时会遇到困难。为了跟上测序数据的快速增长，我们介绍了CLUSTOM-CLOUD，这是第一个基于内存数据网格（IMDG）技术的分布式序列聚类程序-一种分布式数据结构，用于将所有数据存储在内存中。多个计算节点。 IMDG技术比其祖先CLUSTOM更好地帮助CLUSTOM-CLOUD增强了处理大型数据集的能力和计算可伸缩性，同时保持了较高的准确性。使用小型实验室集群（10个节点）并在Amazon EC2云计算环境下，在已发布的16S rRNA人类微生物组序列数据集上评估了CLUSTOM-CLOUD的集群速度。在实验室环境下，无论人类微生物组数据的复杂性如何，仅需约3个小时即可处理200 K读数的数据集。反过来，当利用Amazon EC2云计算环境中的20、30和40个节点时，在大约20、14和11小时内处理了100万次读取。运行时间评估表明，CLUSTOM-CLOUD可以处理比CLUSTOM更大的序列数据集，并且还是可伸缩的分布式处理系统。使用模拟社区的16S rRNA焦磷酸序列进行的比较准确性测试表明，CLUSTOM-CLOUD的准确性高于DOTUR，mothur，ESPRIT-Tree，UCLUST和Swarm。 CLUSTOM-CLOUD是用JAVA编写的，可在上免费获取。

著录项

期刊名称 other
作者
Jeongsu Oh; Chi-Hwan Choi; Min-Kyu Park; Byung Kwon Kim; Kyuin Hwang; Sang-Heon Lee; Soon Gyu Hong; Arshan Nasir; Wan-Sup Cho; Kyung Mo Kim;
展开▼
作者单位

展开▼
年(卷),期 -1(11),3
年度 -1
页码 e0151064
总页数 20
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data [J] . Manal Helal, Fanrong Kong, Sharon C. A. Chen, PLoS One . 2011,第6期

机译：通过16S rRNA基因序列数据的相似性和聚类分析确定诺卡氏菌的参考序列
2. Evaluation of the Integrated Database Network System (IDNS) SmartGene Software for Analysis of 16S rRNA Gene Sequences for Identification of Nocardia Species [J] . Patricia S. Conville, Patrick R. Murray, Adrian M. Zelazny Journal of Clinical Microbiology . 2010,第8期

机译：评估用于鉴定诺卡氏菌物种的16S rRNA基因序列分析的集成数据库网络系统（IDNS）SmartGene软件的评估
3. Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data [J] . Shikha Agnihotry, Aditya N Sarangi, Rakesh Aggarwal The Indian journal of medical research . 2020,第1期

机译：统一策划参考数据库的构建与评估，用于使用16S rRNA序列数据改善细菌的分类学分类
4. Computer-Assisted Bacterial Identification Using 16S rRNA Sequence Data [C] . G. Gonzalez, M. Doud, K. Mathee Southern Biomedical Engineering Conference . 2009

机译：使用16S rRNA序列数据的计算机辅助细菌识别
5. Qualitative assessments and computational techniques for the studies of microbial diversity based on terminal restriction fragment length polymorphism (T-RFLP) of 16S and 18S rRNA gene sequences [D] . Shyu, Conrad. 2006

机译：基于16S和18S rRNA基因序列的末端限制性片段长度多态性（T-RFLP）的微生物多样性研究的定性评估和计算技术
6. Evaluation of the Integrated Database Network System (IDNS) SmartGene Software for Analysis of 16S rRNA Gene Sequences for Identification of Nocardia Species [O] . Patricia S. Conville, Patrick R. Murray, Adrian M. Zelazny 2010

机译：评估用于鉴定诺卡氏菌物种的16S rRNA基因序列分析的集成数据库网络系统（IDNS）SmartGene软件的评估
7. CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment. [O] . Jeongsu Oh, Chi-Hwan Choi, Min-Kyu Park, 2016

机译：CLUsTOm-CLOUD：基于内存数据网格的软件，用于在云环境中聚类16s rRNa序列数据。

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

摘要

著录项

相似文献

相关主题

期刊订阅