Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

Dario Borrelli; Gabriela Gongora Svartzman; Carlo Lipizzi

首页> 外文期刊>PLoS One >Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

【24h】

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

机译：无监督的象征自然语言惯用单位的收购：基于N-GRAM频率的新闻文章和推文的方法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.

机译：符号顺序数据以大量的大量产生，例如文本和语音数据，生物识别，基因组学，金融市场指标，音乐表和在线社交媒体帖子。本文介绍了呈现了顺序文本数据的惯用单位块的无监督方法。文本块是指将一串文本信息分成非重叠相关单元组的任务。这是许多领域的基本问题，其中了解符号顺序数据的原始单元之间的关系是相关的。现有方法主要基于监督和半监督的学习方法;然而，在本研究中，基于现有的n-gram概念提出了一种新的无监督方法，这不需要标记为文本作为输入。拟议的方法适用于两种自然语言集团：华尔街日报语料库和Twitter语料库。在这两种情况下，逐渐增加了语料库长度，以测量与输入不同数量的单一元素的精度。该公司的准确性揭示了准确性的改进，随着令牌数量的增加。对于Twitter语料库，准确性的提高遵循线性趋势。结果表明，提出的方法可以实现更高的准确性，增量使用。未来的研究旨在为提出的方法设计迭代系统。

著录项

来源
《PLoS One》 |2020年第6期|共18页
作者
Dario Borrelli; Gabriela Gongora Svartzman; Carlo Lipizzi;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词

相似文献

外文文献
中文文献
专利

1. Comparing News Articles and Tweets About COVID-19 in Brazil: Sentiment Analysis and Topic Modeling Approach [J] . Tiago de Melo, Carlos M S Figueiredo JMIR public health and surveillance. . 2021,第2期

机译：比较新闻文章和推文关于巴西Covid-19：情绪分析和主题建模方法
2. Towards Understanding Child Language Acquisition: An Unsupervised Multimodal Neural Network Approach [J] . Abel Nyamapfene Journal of information science and engineering . 2011,第5期

机译：努力理解儿童语言习得：一种无监督的多模态神经网络方法
3. Stages of language acquisition in the natural approach to language teaching. [J] . Ibrahim Khattak, Asrar M Sarhad Journal of Agriculture . 2007,第1期

机译：语言教学的自然方法中的语言习得阶段。
4. Unsupervised Taxonomy of Large Document Corpora Utilizing Idiomatic Character of Natural Languages [C] . Nahum Korda, Naphtali Abudarham, Shmulik Regev International Conference on Artificial Intelligence IC-AI'2001 Vol.2, Jun 25-28, 2001, Las Vegas, Nevada, USA . 2001

机译：利用自然语言的惯用特性的大文档语料库的无监督分类法
5. A Natural Language Processing and Machine-Learning Based Approach to Authorship Attribution of Tweets [D] . Day, Siobahn Caroline. 2018

机译：基于自然语言处理和机器学习的推文作者身份归属方法
6. Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2020

机译：无监督的象征自然语言惯用单位的收购：新闻文章和推文的分组的基于n克频率的方法
7. Correction: Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2021

机译：更正：无监督的象征自然语言惯用单位的收购：基于N-GRAM频率的新闻文章和推文的方法

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

摘要

著录项

相似文献

相关主题

期刊订阅