首页> 外文会议>Pacific Asia Conference on Language, Information and Computation >Dataset Construction Method for Word Reading Disambiguation
【24h】

Dataset Construction Method for Word Reading Disambiguation

机译:DataSet施工方法,用于阅读歧义

获取原文

摘要

The scarcity of large corpora in reading dis-ambiguated words is a major limitation in linguistic analysis and the initiation of a statistical approach to word reading disambiguation. As readings of words are usually not written in documents like meanings of words, therefore, human annotation is necessary but expensive. In this study, a method is proposed to construct a reading disambiguated dataset for word reading disambiguation. The method constructs a dataset of sentences wherein words with ambiguity in reading (pronunciation), called hcteronyms, are tagged for correct reading. In this method, a word with unique reading is labeled to a heteronym, and this unique word is used as a query word to collect sentences that include the word. The word in the collected sentences is replaced by the original ambiguous word and the reading corresponding to that of the query word is tagged as the pronunciation of the heteronym. It was confirmed through experiments that the method was able to collect data effectively, and the collected data was numerically balanced among all the readings of the heteronym.
机译:阅读Dirs-andigated言语中的大型公司的稀缺是语言分析的一个主要限制,并开始统计方法阅读歧义歧义。由于单词的读数通常没有用单词的含义写入文件,因此,人类注释是必要的,但昂贵。在该研究中,提出了一种方法来构建读数歧义的数据集,用于阅读歧义。该方法构造句子的数据集,其中读取(发音)中具有歧义的单词被标记为正确的读数。在此方法中,标有一个具有唯一读取的单词,标记为异义,并且该唯一单词用作查询字来收集包含该单词的句子。收集的句子中的单词由原始模糊的单词替换,与查询字的读数相对应的读数被标记为异常的发音。通过实验证实了该方法能够有效收集数据,并且收集的数据在异常的所有读数中是数值平衡的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号