首页> 外文期刊>PLoS Computational Biology >Cross-Over between Discrete and Continuous Protein Structure Space: Insights into Automatic Classification and Networks of Protein Structures
【24h】

Cross-Over between Discrete and Continuous Protein Structure Space: Insights into Automatic Classification and Networks of Protein Structures

机译:离散和连续蛋白质结构空间之间的交叉:深入了解蛋白质结构的自动分类和网络

获取原文
           

摘要

Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we have selected a consensus set of 2,890 domains decomposed very similarly in SCOP and CATH. As an alignment algorithm, we used a global version of MAMMOTH developed in our group, which is both rapid and accurate. As a similarity measure, we used the size-normalized contact overlap, and as a clustering algorithm, we used average linkage. The resulting automatic classification at the cross-over point was more consistent than expert ones with respect to the structure similarity measure, with 86% of the clusters corresponding to subsets of either SCOP or CATH superfamilies and fewer than 5% containing domains in distinct folds according to both SCOP and CATH. Almost 15% of SCOP superfamilies and 10% of CATH superfamilies were split, consistent with the notion of fold change in protein evolution. These results were qualitatively robust for all choices that we tested, although we did not try to use alignment algorithms developed by other groups. Folds defined in SCOP and CATH would be completely joined in the regime of large transitivity violations where clustering is more arbitrary. Consistently, the agreement between SCOP and CATH at fold level was lower than their agreement with the automatic classification obtained using as a clustering algorithm, respectively, average linkage (for SCOP) or single linkage (for CATH). The networks representing significant evolutionary and structural relationships between clusters beyond the cross-over point may allow us to perform evolutionary, structural, or functional analyses beyond the limits of classification schemes. These networks and the underlying clusters are available at http://ub.cbm.uam.es/research/ProtNet.php
机译:蛋白质的结构分类假定存在折叠,这是蛋白质结构域的固有等价类。在这里,我们测试等价类在哪些条件下与客观相似性度量兼容。我们基于等价关系的传递性进行分析,要求A与B的相似性以及B与C的相似性意味着A和C也相似。不同的基因进化使我们期望传递特性应大致成立。但是,如果蛋白质结构域是重复的短多肽片段的组合(如几位作者所述),则部分片段的相似性可能会破坏传递特性,从而有利于连续观察蛋白质结构空间。我们提出了一种措施,当聚类算法将元素连接到聚类中时,可以量化对传递性违规的度量,并且我们发现,这种违规呈现出一个定义明确且可检测的交叉点,这是从近似于传递性的状态(结构相似性高)传递性冲突大,长度相似度低且相似度低的国家。我们认为蛋白质结构空间是离散的,并且层次分类法在此交叉点之前是合理的,而在相似性较低的情况下,结构空间是连续的,应将其表示为网络。我们测试了此度量的定性行为,改变了自动分类过程中涉及的所有选择,即域分解,对齐算法,相似性得分和聚类算法,并且我们发现这种行为非常可靠。最终分类取决于所选择的算法。我们使用聚类系数和传递性违规的值从我们测试的那些中选择最佳选择。有趣的是,该标准也有利于自动分类和专家分类之间的协议。作为域集,我们选择了在SCOP和CATH中非常相似分解的2890个域的共识集。作为对齐算法,我们使用了我们小组中开发的MAMMOTH的全局版本,该版本既快速又准确。作为相似性度量,我们使用了大小归一化的接触重叠,作为聚类算法,我们使用了平均链接。就结构相似性度量而言,在交叉点上得到的自动分类比专家分类更一致,有86%的簇对应于SCOP或CATH超家族的子集,并且少于5%的包含不同折叠域的域SCOP和CATH。将近15%的SCOP超家族和10%的CATH超家族分裂了,这与蛋白质进化中倍数变化的概念一致。尽管我们没有尝试使用其他小组开发的比对算法,但这些结果对于我们测试的所有选择在质量上都非常可靠。 SCOP和CATH中定义的折痕将完全加入到较大的传递性违规行为中,在这种情况下,群集更为随意。一致地,SCOP和CATH之间在折叠水平上的一致性低于它们与分别使用平均链接(对于SCOP)或单链接(对于CATH)作为聚类算法获得的自动分类的一致性。代表跨越交叉点的集群之间重要的进化和结构关系的网络可能使我们能够执行超出分类方案限制的进化,结构或功能分析。这些网络和基础群集可在http://ub.cbm.uam.es/research/ProtNet.php上找到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号