首页> 外文学位 >Incorporating semantic and syntactic information into document representation for document clustering.

【24h】

Incorporating semantic and syntactic information into document representation for document clustering.

机译：将语义和句法信息合并到文档表示中以进行文档聚类。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.

机译：文档聚类是信息检索和文本数据挖掘中广泛使用的策略。在传统的文档聚类系统中，文档表示为一包独立的单词。在本项目中，我们建议通过合并语义信息和句法信息来丰富文档的表示形式。对原始文本执行语义分析和句法分析以识别此信息。提供了有关自然语言处理，句法分析和语义分析的最新研究的详细概述。我们的实验结果表明，对于大多数数据集，合并语义信息和句法信息可以提高文档聚类系统的性能。当我们将句法和语义信息结合在一起时，可以实现统计学上的重大改进。我们使用复合词的实验结果表明，仅使用复合词并不能提高数据集的聚类性能。当将复合词与原始单个词组合时，对于大多数数据集，组合功能集的性能会稍好一些。但是这种改进在统计上并不显着。为了为我们的文档聚类系统选择最佳的聚类算法，对几种广泛使用的聚类算法进行了比较。尽管二等分K均值方法在处理大型数据集时具有优势，但是传统的分层聚类算法仍然可以为小型数据集实现最佳性能。

著录项

作者
Wang, Yong.;
展开▼
作者单位

Mississippi State University.;

展开▼
授予单位 Mississippi State University.;
学科 Computer Science.
学位 Ph.D.
年度 2005
页码 134 p.
总页数 134
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Cross-Context Semantic Document Exchange Through A Novel Tabular Document Representation Approach [J] . Yang Shuo, Wei Ran Journal of information science and engineering . 2021,第2期

机译：通过新颖的表格文档表示方法交换跨上下文语义文档交换
2. Representations of the necklace braid group N B n documentclass[12pt]{minimal} usepackage{amsmath} usepackage{wasysym} usepackage{amsfonts} usepackage{amssymb} usepackage{amsbsy} usepackage{mathrsfs} usepackage{upgreek} setlength{oddsidemargin}{-69pt} egin{document}$${{mathcal {N}}{mathcal {B}}}_n$$end{document} of dimension 4 ( n = 2 , 3 , 4 documentclass[12pt]{minimal} usepackage{amsmath} usepackage{wasysym} usepackage{amsfonts} usepackage{amssymb} usepackage{amsbsy} usepackage{mathrsfs} usepackage{upgreek} setlength{oddsidemargin}{-69pt} egin{document}$$n=2,3,4$$end{document} ) [J] . Taher I. Mayassi, Mohammad N. Abdulrahim Arabian Journal of Mathematics . 2021,第2期

机译：项链编织组的表示<直列式ID = “IEq1”> <替代> 名词乙名词 <特-math ID = “IEq1_TeX”> 的DocumentClass [12磅] {最小} {usepackage amsmath} {usepackage wasysym} {usepackage amsfonts} {usepackage amssymb} {usepackage amsbsy} {usepackage mathrsfs} {usepackage upgreek } setlength { oddsidemargin} { - 69pt} {开始文档} $$ {{ mathcal {N}} { mathcal {B}}} _ñ$$ {端文档} <直列 - 图形的xlink：HREF = “40065_2021_325_Article_IEq1.gif”/> （<直列式ID = “IEq2”> <替代> 名词 = 2 ， 3 ， 4 的DocumentClass [12磅] {最小} {usepackage amsmath} {usepackage wasysym} usepackage {amsfonts} {usepackage amssymb} {usepackage amsbsy} {usepackage mathrsfs} {usepackage upgreek} setlength { oddsidemargin} { - 69pt} {开始文档} $$ N = 2,3,4 $$ {端文档} <直列图形的xlink：HREF = “40065_2021_325_Article_IEq2.gif”/> ）
3. Mining Semantics Structures from Syntactic Structures in Web Document Corpora [J] . Hamid Mousavi, Shi Gao, Deirdre Kerr, International journal of semantic computing . 2014,第4期

机译：从Web文档语料库的句法结构挖掘语义结构。
4. Incorporating Semantic and Syntactic Information in Document Representation for Document Clustering [C] . Yong Wang, Julia Hodges The 9th World Multi-Conference on Systemics, Cybernetics and Informatics(WMSCI 2005) vol.8 . 2005

机译：在文档表示中将语义和句法信息纳入文档聚类
5. Incorporating background knowledge in document clustering. [D] . Fodeh, Samah Jamal. 2010

机译：将背景知识纳入文档聚类。
6. Learning Document Semantic Representation with Hybrid Deep Belief Network [O] . Yan Yan, Xu-Cheng Yin, Sujian Li, 2015

机译：使用混合深度信任网络学习文档语义表示
7. Document aboutness via sophisticated syntactic and semantic features [O] . Marco, Ponza, Paolo, Ferragina, Francesco, Piccinno 2017

机译：通过复杂的句法和语义特征记录文档的有关性

Incorporating semantic and syntactic information into document representation for document clustering.

摘要

著录项

相似文献

相关主题

期刊订阅