首页> 外文会议>IEEE International Conference on Machine Learning and Applications >Graph-Based Analysis of Similarities between Word Frequency Distributions of Various Corpora for Complex Word Identification
【24h】

Graph-Based Analysis of Similarities between Word Frequency Distributions of Various Corpora for Complex Word Identification

机译:基于图谱的基于图谱分布的分析,用于复杂词识别

获取原文

摘要

Complex word identification (CWI) is a fundamental task in educational NLP and applied linguistics which involves the identification of complex words in a text for various applications, including text simplification. Recent studies have independently reported that when word-frequency features from some uncommon corpora are used in combination with those from a general corpus, they improve the CWI accuracy; this suggests that they can be used as adjustments for a general corpus. However, although previous studies have analyzed similarity values between each pair of corpora, the significance of the similarity in the entire set of corpora is unclear. This complicates the analysis of the combination of general and uncommon corpora aimed at improving CWI accuracy; thus, the search for effective types of corpora would have to be exhaustive. To contribute to a better understanding and a non-exhaustive search, this paper proposes a novel graph-based analysis method. We first calculate various similarities among the word frequency distributions of various corpora in an unsupervised manner. Subsequently, we regard each similarity as a weighted graph and analyze the importance of a pair of corpora, or an edge, within the entire graph structure. Through our experiments, it was found that our analysis method can successfully explain why the previously reported combinations of corpora were effective; Furthermore, it can find effective corpus combinations.
机译:复杂的单词识别(CWI)是教育NLP和应用语言学中的基本任务,涉及在文本中识别复杂单词,包括文本简化。最近的研究独立地报道说,当一些罕见的语料库中的字频特征与一般语料库中的那些结合使用时,它们提高了CWI精度;这表明它们可以用作一般语料库的调整。然而,尽管以前的研究已经分析了每对基础之间的相似性值,但是整体集的相似性的重要性尚不清楚。这使得对旨在提高CWI准确性的一般和罕见的语料组合的分析复杂化。因此,寻求有效类型的语料库必须彻底。为了更好地理解和非详尽的搜索,本文提出了一种基于图形的分析方法。我们首先以无人监督的方式计算各种数集的词频分布之间的各种相似之处。随后,我们将每个相似性视为加权图并分析整个图形结构内的一对语料库或边缘的重要性。通过我们的实验,发现我们的分析方法可以成功解释为什么先前报告的Corpora的组合有效;此外,它可以找到有效的语料库组合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号