【24h】

Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus

机译:不使用人工注释语料库的中文单词频率近似

获取原文
获取原文并翻译 | 示例

摘要

Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.
机译:词频在各种与NLP相关的应用中都扮演着重要角色。由于中文的特点,特别是构词法和分词法,中文的词频估计是一个很大的挑战。本文关注的是单词频率估计的问题,条件是我们只有一个中文单词表和一个粗大的中文语料库,并且不对语料库进行任何手动注释。提出了几种在STR(字符串的频率作为词频的近似值)和MM(最大匹配)的框架下近似词频的现实方案。大规模实验表明,尽管在某些情况下其性能仍然不能令人满意,但所提出的方案MinMaxMM可以显着受益于词频估计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号