首页> 外文学位 >Distributed document clustering and cluster summarization in peer-to-peer environments.
【24h】

Distributed document clustering and cluster summarization in peer-to-peer environments.

机译:对等环境中的分布式文档集群和集群摘要。

获取原文
获取原文并翻译 | 示例

摘要

This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.;The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.;The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.;The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.;We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.;The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
机译:本文解决了分布式文档聚类和聚类概述中的难题。挖掘大型文档集合带来了许多挑战,其中之一就是从文档中提取主题或摘要以解释聚类结果。分布式存储库和对等计算的新趋势带来的另一个重要挑战是文档数据变得越来越分散。关键短语提取算法可以有效地从文档集群中提取候选单词并对其进行评分。该算法称为CorePhrase,它基于将文档集合建模为图形的基础,我们可以在该图形上利用图形挖掘来提取频繁和重要的短语,这些短语用于标记集群。结果表明,CorePhrase可以非常高精度地提取与群集中的文档相关的关键字短语。尽管此算法可用于汇总集中式群集,但它专门用于分布式群集中,既可以提高分布式群集的准确性,又可以为分布式群集提供摘要。分布式文档群集的第一种方法称为协作对等文档集群,将对等网络中的节点建模为协作节点,以提高单个本地集群解决方案的质量。这是通过在对等点之间交换本地群集摘要,然后推荐要合并到远程群集中的文档来实现的。大量分布式文档集的结果表明:(i)这种协作技术在单个节点的最终聚类中取得了显着改善; (ii)与协作之前的初始群集相比,节点数量较大的网络通常在协作之后的群集方面实现更大的改进,但另一方面,与节点数量较少的网络相比,它们倾向于实现较低的绝对群集质量; (iii)随着在节点之间引入更多的数据重叠,协作往往对提高聚类质量影响不大。;用于分布式文档聚类的第二种方法称为分层分布式文档聚类。与协作模型不同,此模型旨在在整个网络上生成一个集群解决方案。通过将分布式群集问题建模为节点邻域的层次结构,它专门解决了网络规模的可伸缩性,从而解决了分布式群集的复杂性。全局分布式群集的汇总是通过CorePhrase算法的分布式版本实现的。大型文档集上的结果表明:(i)分布式群集的准确性不受单级网络节点数量增加的影响; (ii)我们可以通过提高层次结构来达到不错的加速效果,但是会牺牲集群质量,而集群质量会随着层次结构的上升而降低; (iii)在任意增长的网络中,数据在邻域中的分散程度更高,从而导致质心生成较差,因此建议我们在不增加数据集大小的情况下,不应将网络中的节点数量增加到一定水平以上; (iv)分布式聚类汇总可以产生与集中式汇总相似的准确摘要。我们引入了一种同时使用多个文档的关键词提取来解释文档聚类的解决方案。我们还针对点对点环境中的分布式文档集群问题引入了两种解决方案,每种解决方案都满足不同的目标:通过协作最大化本地集群质量,以及通过协作最大化全局集群质量。大型分布式文档集合的可扩展性,可伸缩性和可解释性。使用当前的方法获得相同的结果需要首先集中数据,这有时是不可行的。

著录项

  • 作者

    Hammouda, Khaled M.;

  • 作者单位

    University of Waterloo (Canada).;

  • 授予单位 University of Waterloo (Canada).;
  • 学科 Engineering System Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 201 p.
  • 总页数 201
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号