首页> 外文学位 >Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods.
【24h】

Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods.

机译:利用基于图的方法,利用外部/领域知识来增强传统的文本挖掘。

获取原文
获取原文并翻译 | 示例

摘要

Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods.;In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The MRF-based algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.
机译:寻找利用外部/领域知识来增强传统文本挖掘的最佳方法一直是一项艰巨的任务。困难集中在缺乏表示集成了外部/领域知识的文档的手段上。图形是功能强大且用途广泛的工具,可用于简单地说明复杂问题,因此在科学和工程学的各个子领域中都非常有用。但是,关于知识表示和发现的基于图的方法仍然相对未被开发。本文提出了一种基于图的文本挖掘系统,将语义知识,文档部分知识,文档链接知识和文档类别知识纳入文本聚类和主题分析任务。我设计了一种新颖的术语级图知识表示法和基于图的聚类算法,以将语义和文档部分知识纳入到生物医学文献聚类和主题分析中。我提出了带有松弛标记(RL)算法的马尔可夫随机场(MRF),以结合文档链接知识。我评估了文档之间的不同类型的链接,包括显式链接(例如超链接和引文链接),隐式链接(例如合著者链接和共引文链接)以及伪链接(例如相似性链接)。我开发了一种新颖的基于语义的方法,将作为外部知识的Wikipedia概念和类别集成到传统文档聚类中。为了支持这些新方法,我开发了两种自动算法来分别提取多词短语和本体概念。对新闻收集,Web数据集和生物医学文献的评估证明了所提方法的有效性。;在文档聚类实验中,所提出的基于术语图的基于图的方法不仅在所有配置下均优于基线k均值算法,而且在效率方面也很优越。基于MRF的算法在包含显式或隐式链接的数据集上显着改善了球形k均值和基于模型的k均值聚类; Wikipedia基于知识的聚类也改进了仅基于文档内容的聚类。在主题分析任务上,提出的图表示,子图检测和图排名算法可以有效地识别语料库级主题词和聚类级主题词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号