首页> 外文学位 >Distributed document clustering and cluster summarization in peer-to-peer environments.

【24h】

Distributed document clustering and cluster summarization in peer-to-peer environments.

机译：对等环境中的分布式文档集群和集群摘要。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.;The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.;The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.;The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.;We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.;The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.

机译：本文解决了分布式文档聚类和聚类概述中的难题。挖掘大型文档集合带来了许多挑战，其中之一就是从文档中提取主题或摘要以解释聚类结果。分布式存储库和对等计算的新趋势带来的另一个重要挑战是文档数据变得越来越分散。关键短语提取算法可以有效地从文档集群中提取候选单词并对其进行评分。该算法称为CorePhrase，它基于将文档集合建模为图形的基础，我们可以在该图形上利用图形挖掘来提取频繁和重要的短语，这些短语用于标记集群。结果表明，CorePhrase可以非常高精度地提取与群集中的文档相关的关键字短语。尽管此算法可用于汇总集中式群集，但它专门用于分布式群集中，既可以提高分布式群集的准确性，又可以为分布式群集提供摘要。分布式文档群集的第一种方法称为协作对等文档集群，将对等网络中的节点建模为协作节点，以提高单个本地集群解决方案的质量。这是通过在对等点之间交换本地群集摘要，然后推荐要合并到远程群集中的文档来实现的。大量分布式文档集的结果表明：（i）这种协作技术在单个节点的最终聚类中取得了显着改善；（ii）与协作之前的初始群集相比，节点数量较大的网络通常在协作之后的群集方面实现更大的改进，但另一方面，与节点数量较少的网络相比，它们倾向于实现较低的绝对群集质量；（iii）随着在节点之间引入更多的数据重叠，协作往往对提高聚类质量影响不大。；用于分布式文档聚类的第二种方法称为分层分布式文档聚类。与协作模型不同，此模型旨在在整个网络上生成一个集群解决方案。通过将分布式群集问题建模为节点邻域的层次结构，它专门解决了网络规模的可伸缩性，从而解决了分布式群集的复杂性。全局分布式群集的汇总是通过CorePhrase算法的分布式版本实现的。大型文档集上的结果表明：（i）分布式群集的准确性不受单级网络节点数量增加的影响；（ii）我们可以通过提高层次结构来达到不错的加速效果，但是会牺牲集群质量，而集群质量会随着层次结构的上升而降低；（iii）在任意增长的网络中，数据在邻域中的分散程度更高，从而导致质心生成较差，因此建议我们在不增加数据集大小的情况下，不应将网络中的节点数量增加到一定水平以上；（iv）分布式聚类汇总可以产生与集中式汇总相似的准确摘要。我们引入了一种同时使用多个文档的关键词提取来解释文档聚类的解决方案。我们还针对点对点环境中的分布式文档集群问题引入了两种解决方案，每种解决方案都满足不同的目标：通过协作最大化本地集群质量，以及通过协作最大化全局集群质量。大型分布式文档集合的可扩展性，可伸缩性和可解释性。使用当前的方法获得相同的结果需要首先集中数据，这有时是不可行的。

著录项

作者
Hammouda, Khaled M.;
展开▼
作者单位

University of Waterloo (Canada).;

展开▼
授予单位 University of Waterloo (Canada).;
学科 Engineering System Science.
学位 Ph.D.
年度 2007
页码 201 p.
总页数 201
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. RANK BASED DOCUMENT CLUSTERING AND SUMMARIZATION APPROACH IN THE DISTRIBUTED P2P NETWORK [J] . A.SRINIVASA RAO, Dr.Ch.DIVAKAR, Dr.A.GOVARDHAN Journal of Theoretical and Applied Information Technology . 2015,第2期

机译：分布式P2P网络中基于等级的文档聚类与汇总方法。
2. Extraction Based Multi Document Summarization using Single Document Summary Cluster [J] . Shanmugasundaram Hariharan International Journal of Advances in Soft Computing and Its Applications . 2010,第1期

机译：使用单文档摘要集群的基于提取的多文档摘要
3. Distributed collaborative Web document clustering using cluster keyphrase summaries [J] . Khaled Hammouda, Mohamed Kamel Information Fusion . 2008,第4期

机译：使用群集关键字摘要的分布式协作Web文档群集
4. Dynamic peer-to-peer distributed document clustering and cluster summarization [C] . Meena SM. International Conference on Sustainable Energy and Intelligent Systems . 2012

机译：动态点对点分布式文档聚类和群集摘要
5. Multi-document Summarization Based on Document Clustering and Neural Sentence Fusion [D] . Fuad, Tanvir Ahmed. 2018

机译：基于文档聚类和神经句子融合的多文件摘要
6. A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method [O] . Illhoi Yoo, Xiaohua Hu, Il-Yeol Song 2007

机译：基于相干图的生物医学文献语义聚类和总结方法及新的评价方法
7. Integrating clustering and multi-document summarization to improve document understanding [O] . Dingding Wang, Shenghuo Zhu, Tao Li, 2008

机译：集成集群和多文档摘要以提高对文档的理解
8. QCS : A System for Querying, Clustering, and Summarizing Documents [R] . Dunlavy, D. M. 2006

机译：QCs：查询，聚类和汇总文档的系统

Distributed document clustering and cluster summarization in peer-to-peer environments.

摘要

著录项

相似文献

相关主题

期刊订阅