首页> 外文期刊>BMC Bioinformatics >Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures
【24h】

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

机译:基于功能不同的残基签名的蛋白质域亚家族的自动分层分类

获取原文
           

摘要

Background The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. Conclusions This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.
机译:背景技术NCBI保守域数据库(CDD)由蛋白质域的多个序列比对组成,这些序列在保守和发散的序列和结构特征的基础上处于手动策划为进化层次的各个阶段。这些域模型带有注释,可以通过基于Web的BLAST搜索提供有关序列,结构和功能之间关系的见解。结果在这里,我们使用启发式和马尔可夫链蒙特卡罗(MCMC)采样程序的组合并从(通常非常大的)多序列比对开始,自动生成保守域(CD)层次结构。此过程依赖于统计标准来基于与蛋白质功能专业化相关的保守和发散序列模式定义每个层次。同时,这有利于功能上重要的残基的序列和结构注释。这些统计标准还提供了一种客观评估CD层次结构质量的方法,这是一项艰巨的任务,考虑到蛋白质亚组之间的联系通常非常遥远,在这种情况下,标准的系统发育方法可能不可靠。我们的目标是基于统计标准和视觉比较自动生成(通常不是最佳)层次结构,该层次结构与手动管理的层次结构可比。这是朝着获得最佳层次分类的最终目标迈出的第一步。该算法中时间最密集(不可并行化)部分的运行时间图显示出几乎线性的时间复杂度,因此,即使对于非常大的Rossmann折叠蛋白类,也可以在大约一天内获得结果。结论此方法可自动快速创建蛋白质域层次结构,从而消除了保守域数据库管理最耗时的方面之一。同时,它还可以通过识别最能区分每个蛋白质结构域亚组和其他相关亚组的模式残基来促进蛋白质结构域注释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号