首页> 外文期刊>The Computer journal >A Self-enriching Methodology for Clustering Narrow Domain Short Texts
【24h】

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

机译:一种自动富集方法,用于对窄域短文本进行聚类

获取原文
获取原文并翻译 | 示例
           

摘要

Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.
机译:由于要聚类的语料库的内在特征,将窄域短文本聚类被认为是一项复杂的任务:(i)短文本中词汇术语的频率较低,以及(ii)与窄域相关的高词汇重叠。本文的目的是介绍一种自足扩展方法,以改善在处理此类语料库时聚类方法的性能。这种方法允许通过从自动构建的词汇知识资源(从相同的目标数据集(而不是外部资源)获得)中添加互相关的术语来丰富原始文本数据。我们还提出了一套有监督和无监督的文本评估方法,用于评估不同的语料库特征,例如,短性,风格和域宽。借助这些措施,我们可以预先确定是否使用本文提出的方法。最后,我们将所有这些评估方法集成到一个免费的基于网络的名为Watermarking Corpora在线系统的系统中,计算机科学家可以使用该系统来评估与给定文本语料库相关的不同功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号