...
首页> 外文期刊>MBio >Reply to Holmes and Duchêne, “Can Sequence Phylogenies Safely Infer the Origin of the Global Virome?”: Deep Phylogenetic Analysis of RNA Viruses Is Highly Challenging but Not Meaningless
【24h】

Reply to Holmes and Duchêne, “Can Sequence Phylogenies Safely Infer the Origin of the Global Virome?”: Deep Phylogenetic Analysis of RNA Viruses Is Highly Challenging but Not Meaningless

机译:答复Holmes和Duchêne,“序列系统发育可以安全地推断出全球病毒体的起源吗?”:RNA病毒的深度系统发生分析极具挑战性,但并非毫无意义

获取原文
           

摘要

REPLY In their Letter to the Editor of mBio , written in response to our recent article on evolution of the global RNA virome ( 1 ), Holmes and Duchêne submit that the extreme sequence divergence between the RNA-dependent RNA polymerases (RdRps) makes it impossible to infer deep relationships between RNA viruses from any type of sequence analysis. We certainly agree with Holmes and Duchêne that extreme caution is due in the analysis and interpretation of deep phylogenies, and in particular, that alignment quality is central to our ability to resolve long-distance evolutionary relationships. If the alignment is largely wrong (i.e., does not align homologous protein sites) or noninformative (i.e., cannot be used to distinguish between alternative histories), it is of no utility for phylogenetic reconstruction. Moreover, even a correct and informative alignment does not guarantee correct phylogenetic reconstruction due to the technical limitations of the software, systematic biases of the available evolutionary models, and the fundamentally random nature of sequence divergence. Therefore, formal phylogenetic analysis should be accompanied by careful consideration of the associated biological data and examined in terms of the implications of the respective evolutionary scenarios. Where exactly lies the boundary between an alignment that is suitable for phylogenetic reconstruction and one that is “highly unlikely to be accurate” is far from being an easy question. In the ideal situation (high sequence similarity, random homoplasy), one might need as little as O (log k ) informative sites to resolve a tree of k sequences ( 2 ). With real-life data, it is critical that sequence similarity, even if extremely low between the most distant sequences, changes according to a pattern, consistent with the tree structure. Fortunately, the structure of proteins with their nearly invariant functional sites, strongly conserved structural core, variable bulk, and extremely fluid interface surfaces naturally provides such an essential pattern, with each class of sites in the scale of evolutionary rates allowing for good resolution at the appropriate range of distances. Furthermore, the very definition of a “random” site is not a trivial matter. An alignment site that contains all 20 amino acids might appear completely random, but in fact, its validity and utility greatly depend on the amino acid distribution pattern. An obvious hypothetical example is a site where 110 sequences have one amino acid, 110 sequences have another amino acid, and the remaining 18 sequences each have different amino acids. Such a site would contain a strong bipartition signal, and if the other positions of the discordant sequences show affinity with one of these two groups, it would be highly informative for tree reconstruction. The alignment of 228 RNA-dependent RNA polymerases (RdRps) from RNA viruses and 10 reverse transcriptases (RTs) that was employed in our work ( 1 ) to construct the global tree of RNA viruses does indeed push the envelope of usable sequence similarity. As Holmes and Duchêne note, there are no invariant sites, no sites without gaps, more than 96% of the alignment columns contain more than 50% of gaps, and where sites are aligned, the similarity is low (the median distance between RTs and RdRps is 5.0 substitutions per site as estimated by PhyML). However, some of these metrics, although correctly calculated, do not give the full picture of the alignment properties. Although as indicated above, only 441 sites contain less than 50% of gaps in an alignment of the total length of 12,200, the median length of the RdRp core is 497 amino acids, so that actually, 89% of a typical sequence is part of a reasonable alignment. The plot of the conservation (alignment column homogeneity) and gap content shows multiple, sharp peaks of relatively high conservation and low gap content. Moreover, these regions correspond to well-known motifs that are conserved among the RdRps, across the evolutionary distance of more than five substitutions per site, on average ( Fig.?1 ). Although this level of conservation might appear insufficient to capture the deepest relationships between the RNA viruses, one should keep in mind that, at the deepest level, there are few major clades to resolve (according to our analysis, the RT and five branches of RdRps). The alignment statistics rapidly improve at the shallower levels: even within each major branch, the clade-specific conservation is readily apparent ( Table?1 ). FIG?1 Sequence conservation profile along the core alignment of the RdRps and RTs. The homogeneity metric is based on the BLOSUM62 scores between the consensus amino acid and the actual amino acids in the alignment column and are scaled from 1 (all residues are the same) to 0 (the score is not different from the random expectation). The fraction of gaps is computed using sequence weights ( 6 ). The amino acids conserved in five prominent motifs
机译:答复Holmes和Duchêne在写给mBio编辑的信中回应了我们最近发表的有关全球RNA病毒进化的文章(1),他认为RNA依赖性RNA聚合酶(RdRps)之间的极端序列差异使得不可能从任何类型的序列分析中推断RNA病毒之间的深层关系。我们当然同意Holmes和Duchêne的观点,即在深部系统发育的分析和解释中要格外谨慎,尤其是对齐质量对于我们解决长距离进化关系的能力至关重要。如果比对在很大程度上是错误的(即,不比对同源蛋白质位点)或非信息性的(即,不能用于区分不同的历史),则其在系统发育重建中没有用。此外,由于软件的技术局限性,可用进化模型的系统偏差以及序列差异的基本随机性,即使是正确且信息丰富的比对也不能保证正确的系统发育重建。因此,正式的系统发育分析应伴有对相关生物学数据的仔细考虑,并应根据各自进化方案的含义进行检查。适于系统发育重建的比对与“极不可能准确”的比对之间的界限究竟位于何处,并不是一个容易提出的问题。在理想情况下(高序列相似性,随机同质性),一个人可能只需要O(log k)个信息位点即可解析一棵k序列的树(2)。对于现实生活中的数据,至关重要的是,即使在最远的序列之间的序列相似性极低,序列相似性也要根据与树形结构一致的模式而变化。幸运的是,具有几乎不变的功能位点,高度保守的结构核心,可变的体积以及极富流动性的界面表面的蛋白质结构自然提供了这种基本模式,进化速率范围内的每一类位点都可以在蛋白质上获得良好的分离度。适当的距离范围。此外,“随机”站点的定义并非无关紧要。包含所有20个氨基酸的比对位点可能看起来完全是随机的,但实际上,其有效性和实用性在很大程度上取决于氨基酸的分布方式。一个明显的假设示例是一个位点,其中110个序列具有一个氨基酸,110个序列具有另一个氨基酸,其余18个序列各自具有不同的氨基酸。这样的位点将包含强的双向信号,并且如果不一致序列的其他位置显示出与这两个组之一的亲和力,则对于树的重建将是非常有用的。来自我们的工作(1)中的228种依赖RNA的RNA聚合酶(RdRps)与10种逆转录酶(RTs)的比对,确实构成了可用序列相似性的包膜。正如Holmes和Duchêne所指出的那样,没有不变的位点,没有空位的位点,超过96%的比对列包含超过50%的空位,并且如果位点对齐,则相似度很低(RT和根据PhyML的估算,RdRps是每个位点的5.0个替换。但是,这些度量中的一些尽管已正确计算,但却无法提供对齐属性的完整信息。尽管如上所述,在总长度为12,200的序列中,只有441个位点包含少于50%的缺口,但RdRp核心的中位长度为497个氨基酸,因此,实际上,典型序列的89%是合理对齐。守恒(对准柱的均质性)和间隙含量的图显示了相对较高的保守度和低间隙含量的多个尖锐峰。而且,这些区域对应于在每个位点平均超过五个取代的进化距离上RdRps之间保守的众所周知的基序(图1)。尽管这种保守程度似乎不足以捕获RNA病毒之间的最深层次关系,但应记住,在最深层次上,几乎没有几个主要方面需要解决(根据我们的分析,RT和RdRps的五个分支)。排列统计数据在较浅的水平迅速提高:即使在每个主要分支内,枝条特异性的保护也很明显(表1)。图1沿RdRps和RTs核心比对的序列保守图谱。均一性度量基于比对列中共有氨基酸和实际氨基酸之间的BLOSUM62得分,并且从1(所有残基相同)到0(得分与随机预期无差异)缩放。缺口的分数是使用序列权重(6)计算的。氨基酸以五个突出基序保守

著录项

相似文献

  • 外文文献
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号