...
首页> 外文期刊>Artificial Intelligence Review: An International Science and Engineering Journal >An effective web document clustering algorithm based on bisection and merge
【24h】

An effective web document clustering algorithm based on bisection and merge

机译:基于二等分和合并的有效Web文档聚类算法

获取原文
获取原文并翻译 | 示例
           

摘要

To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K -means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K -means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.
机译:为了对所有具有相同名称实体的Web文档进行聚类,我们尝试使用现有的聚类算法,例如K-means和光谱聚类。出乎意料的是,事实证明这些算法对群集Web文档无效。根据我们的深入调查,我们发现将此类网页聚类更为复杂是因为(1)聚类的数量(称为基本事实)大于一般聚类问题中的两个或三个聚类,并且(2)数据中的聚类集合的簇大小分布极度不对称。为了克服上述问题,本文提出了一种有效的聚类算法来提高K均值的准确性和频谱聚类算法。特别是,为了处理群集大小的偏斜分布,我们的算法基于相似度图G的归一化切口执行了二等分和合并步骤,以正确地群集Web文档。我们的实验结果表明,与光谱二等分相比,我们的算法将性能提高了约56%,与K均值相比,性能提高了36%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号