首页> 外文学位 >Incorporating semantic and syntactic information into document representation for document clustering.
【24h】

Incorporating semantic and syntactic information into document representation for document clustering.

机译:将语义和句法信息合并到文档表示中以进行文档聚类。

获取原文
获取原文并翻译 | 示例

摘要

Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.
机译:文档聚类是信息检索和文本数据挖掘中广泛使用的策略。在传统的文档聚类系统中,文档表示为一包独立的单词。在本项目中,我们建议通过合并语义信息和句法信息来丰富文档的表示形式。对原始文本执行语义分析和句法分析以识别此信息。提供了有关自然语言处理,句法分析和语义分析的最新研究的详细概述。我们的实验结果表明,对于大多数数据集,合并语义信息和句法信息可以提高文档聚类系统的性能。当我们将句法和语义信息结合在一起时,可以实现统计学上的重大改进。我们使用复合词的实验结果表明,仅使用复合词并不能提高数据集的聚类性能。当将复合词与原始单个词组合时,对于大多数数据集,组合功能集的性能会稍好一些。但是这种改进在统计上并不显着。为了为我们的文档聚类系统选择最佳的聚类算法,对几种广泛使用的聚类算法进行了比较。尽管二等分K均值方法在处理大型数据集时具有优势,但是传统的分层聚类算法仍然可以为小型数据集实现最佳性能。

著录项

  • 作者

    Wang, Yong.;

  • 作者单位

    Mississippi State University.;

  • 授予单位 Mississippi State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2005
  • 页码 134 p.
  • 总页数 134
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号