A Self-enriching Methodology for Clustering Narrow Domain Short Texts

David Pinto; Paolo Rosso; HEctor JimEnez-Salazar

首页> 外文期刊>The Computer journal >A Self-enriching Methodology for Clustering Narrow Domain Short Texts

【24h】

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

机译：一种自动富集方法，用于对窄域短文本进行聚类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.

机译：由于要聚类的语料库的内在特征，将窄域短文本聚类被认为是一项复杂的任务：（i）短文本中词汇术语的频率较低，以及（ii）与窄域相关的高词汇重叠。本文的目的是介绍一种自足扩展方法，以改善在处理此类语料库时聚类方法的性能。这种方法允许通过从自动构建的词汇知识资源（从相同的目标数据集（而不是外部资源）获得）中添加互相关的术语来丰富原始文本数据。我们还提出了一套有监督和无监督的文本评估方法，用于评估不同的语料库特征，例如，短性，风格和域宽。借助这些措施，我们可以预先确定是否使用本文提出的方法。最后，我们将所有这些评估方法集成到一个免费的基于网络的名为Watermarking Corpora在线系统的系统中，计算机科学家可以使用该系统来评估与给定文本语料库相关的不同功能。

著录项

来源
《The Computer journal》 |2011年第7期|p.1148-1165|共18页
作者
David Pinto; Paolo Rosso; HEctor JimEnez-Salazar;
展开▼
作者单位

Faculty of Computer Science, Benemerita Universidad Autonoma de Puebla, Puebla, Mexico;

Natural Language Engineering Lab., ELiRF, Universidad Politecnica de Valencia, Valencia, Spain;

Information Technologies Dept., Universidad Autonoma Metropolitana, Mexico city, Mexico;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
clustering and analysis of textual data; narrow domain short texts; natural language processing; internet tools;

机译：文本数据的聚类和分析;窄域短文本;自然语言处理;互联网工具;

相似文献

外文文献
中文文献
专利

1. A Self-enriching Methodology for Clustering Narrow Domain Short Texts [J] . David Pinto, Paolo Rosso, Héctor Jiménez-Salazar Computer Journal, The . 2011,第7期

机译：一种自动富集方法，用于对窄域短文本进行聚类
2. Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm [J] . Khaled Abdalgader IAENG Internaitonal journal of computer science . 2017,第4a2a期

机译：使用基于质心的词法聚类算法对短文本进行聚类
3. Clusters Merging Method for Short Texts Clustering [J] . Yu Wang, Lihui Wu, Hongyu Shao Open Journal of Social Sciences . 2014,第9期

机译：短文本聚类的聚类合并方法
4. Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance [C] . David Pinto, Jose-Miguel Benedi, Paolo Rosso Computational Linguistics and Intelligent Text Processing; Lecture Notes in Computer Science; 4394 . 2007

机译：使用Kullback-Leibler距离对窄域短文本进行聚类
5. Eigenvalue Asymptotics of Narrow Domains [D] . Fang, Lanbo. 2019

机译：窄域的特征值渐近学
6. Short and narrow flag leaf1 a GATA zinc finger domain-containing protein regulates flag leaf size in rice (Oryza sativa) [O] . Peilong He, Xiaowen Wang, Xiaobo Zhang, 2018

机译：短而窄的旗叶1一种含有GATA锌指结构域的蛋白质调节水稻（Oryza sativa）的旗叶大小
7. Narrow-domain Short Texts Clustering Algorithm [O] . Popova, S. V., Khodyrev, I. A. 2011

机译：窄域短文本聚类算法

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

摘要

著录项

相似文献

相关主题

期刊订阅