Clustering homologous proteins is one of the important tasks in functional genomics. Homologous proteins may share common functions. Annotating proteins of unknown function by transferring annotations from their homologues of known annotations is one of the efficient ways to predict protein function. We use a modularity-based method called CD for grouping together homologous proteins. The method employs a global heuristic search strategy to find the partitioning of the weighted adjacency graph with the largest modularity. The weighted adjacency graph is constructed by the sigmodal transformation of all pairwise sequence similarities between all protein sequences in a given dataset. The method has been extensively tested on several subsets from the superfamily level of the SCOP (Structural Classification of Proteins) database, where some homologous proteins have very low sequence similarity. Compared with a widely used method MCL, we observe that the number of clusters obtained by CD is closer to the number of superfamilies in the dataset, the value of the F-measure given by CD is 10% better than MCL on average, and CD is more tolerant to noise to the sequence similarity. Our results indicate that CD is ideally suitable for clustering homologous proteins when sequence similarity is low.
展开▼