...
首页> 外文期刊>PLoS One >Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
【24h】

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

机译:无监督的象征自然语言惯用单位的收购:基于N-GRAM频率的新闻文章和推文的方法

获取原文
           

摘要

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.
机译:符号顺序数据以大量的大量产生,例如文本和语音数据,生物识别,基因组学,金融市场指标,音乐表和在线社交媒体帖子。本文介绍了呈现了顺序文本数据的惯用单位块的无监督方法。文本块是指将一串文本信息分成非重叠相关单元组的任务。这是许多领域的基本问题,其中了解符号顺序数据的原始单元之间的关系是相关的。现有方法主要基于监督和半监督的学习方法;然而,在本研究中,基于现有的n-gram概念提出了一种新的无监督方法,这不需要标记为文本作为输入。拟议的方法适用于两种自然语言集团:华尔街日报语料库和Twitter语料库。在这两种情况下,逐渐增加了语料库长度,以测量与输入不同数量的单一元素的精度。该公司的准确性揭示了准确性的改进,随着令牌数量的增加。对于Twitter语料库,准确性的提高遵循线性趋势。结果表明,提出的方法可以实现更高的准确性,增量使用。未来的研究旨在为提出的方法设计迭代系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号