首页> 美国卫生研究院文献>Algorithms for Molecular Biology : AMB >On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
【2h】

On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences

机译:关于c-max-tolerance图中的最大集团及其在分子序列聚类中的应用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n3 + out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n2 log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets.
机译:给定n个本地对齐序列的集合S,将其划分为非常相似的序列组以促进后续计算(例如系统发育树的生成)是必要的前提条件。本文介绍了一种新的聚类方法,该方法将S划分为子集,以使子集内每对序列的重叠至少是两个序列长度的给定百分比c。我们表明,可以将这个问题简化为在一种特殊的最大公差图中找到所有最大派系,我们称之为c-max-tolerance图。之前我们已经证明,在O(n 3 + out)中可以有效地找到一般最大公差图中的所有最大派系。在这里,使用一种新型的扫描线算法,我们证明了对c-max-tolerance图的限制产生了更好的O(n 2 log n + out)运行时间。此外,我们提出了另一种算法,该算法易于实现,尽管理论上比第一种算法慢,但仍在多项式时间内运行。然后,我们根据所选的c值,对c-max-tolerance图中的所有最大集团的数量和结构进行实验分析。我们将简单的算法应用于人工和生物数据,并且表明该实现比众所周知的应用程序Cliquer快得多。通过引入一种使用所有最大派系的集合进行分区的新启发式方法,我们最终表明,计算出的分区为生物数据集提供了合理的聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号