Graph-Based Analysis of Similarities between Word Frequency Distributions of Various Corpora for Complex Word Identification

机译：基于图谱的基于图谱分布的分析，用于复杂词识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Complex word identification (CWI) is a fundamental task in educational NLP and applied linguistics which involves the identification of complex words in a text for various applications, including text simplification. Recent studies have independently reported that when word-frequency features from some uncommon corpora are used in combination with those from a general corpus, they improve the CWI accuracy; this suggests that they can be used as adjustments for a general corpus. However, although previous studies have analyzed similarity values between each pair of corpora, the significance of the similarity in the entire set of corpora is unclear. This complicates the analysis of the combination of general and uncommon corpora aimed at improving CWI accuracy; thus, the search for effective types of corpora would have to be exhaustive. To contribute to a better understanding and a non-exhaustive search, this paper proposes a novel graph-based analysis method. We first calculate various similarities among the word frequency distributions of various corpora in an unsupervised manner. Subsequently, we regard each similarity as a weighted graph and analyze the importance of a pair of corpora, or an edge, within the entire graph structure. Through our experiments, it was found that our analysis method can successfully explain why the previously reported combinations of corpora were effective; Furthermore, it can find effective corpus combinations.

机译：复杂的单词识别（CWI）是教育NLP和应用语言学中的基本任务，涉及在文本中识别复杂单词，包括文本简化。最近的研究独立地报道说，当一些罕见的语料库中的字频特征与一般语料库中的那些结合使用时，它们提高了CWI精度;这表明它们可以用作一般语料库的调整。然而，尽管以前的研究已经分析了每对基础之间的相似性值，但是整体集的相似性的重要性尚不清楚。这使得对旨在提高CWI准确性的一般和罕见的语料组合的分析复杂化。因此，寻求有效类型的语料库必须彻底。为了更好地理解和非详尽的搜索，本文提出了一种基于图形的分析方法。我们首先以无人监督的方式计算各种数集的词频分布之间的各种相似之处。随后，我们将每个相似性视为加权图并分析整个图形结构内的一对语料库或边缘的重要性。通过我们的实验，发现我们的分析方法可以成功解释为什么先前报告的Corpora的组合有效;此外，它可以找到有效的语料库组合。

著录项

来源
《IEEE International Conference on Machine Learning and Applications》|2019年|1 v.|共5页
会议地点
作者
Yo Ehara;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词
Encyclopedias; Task analysis; Vocabulary; Electronic publishing; Internet; Natural language processing;

机译：百科全书;任务分析;词汇;电子出版;互联网;自然语言处理;

相似文献

外文文献
中文文献
专利

1. Content analysis: Frequency distribution of words [J] . Dicle Mehmet F., Dicle Betul The stata journal . 2018,第2期

机译：内容分析：词频分布
2. Finite word-length effects in implementation of distributions for time-frequency signal analysis [J] . Ivanovic V., Stankovic L. IEEE Transactions on Signal Processing . 1998,第7期

机译：时频信号分析的分布实现中的有限字长效应
3. Surviving Blind Decomposition: A Distributional Analysis of the Time-Course of Complex Word Recognition [J] . Schmidtke Daniel, Matsuki Kazunaga, Kuperman Victor Journal of experimental psychology. Learning, memory, and cognition . 2017,第11期

机译：幸存盲分解：复杂词识别时间过程的分布分析
4. Graph-Based Analysis of Similarities between Word Frequency Distributions of Various Corpora for Complex Word Identification [C] . Yo Ehara IEEE International Conference on Machine Learning and Applications . 2019

机译：基于图的各种语料库词频分布相似度分析
5. Automatic acquisition of lexical semantic knowledge from large corpora: The identification of semantically related words, markedness, polarity, and antonymy. [D] . Hatzivassiloglou, Vasileios. 1998

机译：从大型语料库自动获取词汇语义知识：识别与语义相关的单词，标记，极性和反义词。
6. Surviving blind decomposition: a distributional analysis of the time-course of complex word recognition [O] . Daniel Schmidtke, Kazunaga Matsuki, Victor Kuperman -1

机译：幸存的盲分解：复杂单词识别的时程分布分析
7. Distributional Similarity of Words with Different Frequencies [O] . Wartena Christian 2013

机译：不同频率词的分布相似性

Graph-Based Analysis of Similarities between Word Frequency Distributions of Various Corpora for Complex Word Identification

摘要

著录项

相似文献

相关主题

期刊订阅