...
首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION
【24h】

JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION

机译:基于图形的日语语言模型及其在在线字符识别中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

This paper deals with a postprocessing method based on the n-gram approach for Japanese character recognition. In Japanese a small number of phonetic characters (Kana) and thousands of Kanji characters, which are ideographs, are used for describing ordinary sentences. In other words, Japanese sentences not only have a large character set, but also include characters with different entropies. It is therefore difficult to apply conventional methodologies based on n-grams to postprocessing in Japanese character recognition. In order to resolve the above two difficulties, we propose a method that uses parts of speech in the Following ways. One is to reduce the number of Kanji characters by clustering them according to the parts of speech that each Kanji character is used in. Another is to increase the entropy of a Kana character by classifying it into more detailed subcategories with part-of-speech attributes. We applied a bigram approach based on these two techniques to a Japanese language model. Experiments yielded the following two results: (1) our language model resolved the imbalance between Kana and Kanji characters, and reduced the perplexity of Japanese to less than 100, when Japanese newspaper texts (containing a total of approximately three million characters) were used for the learning of our model, and (2) the postprocessing using the model for on-line character recognition rectified about half of all substitution errors when the correct characters were among the candidates. [References: 10]
机译:本文讨论了一种基于n-gram方法的日语字符识别后处理方法。在日语中,少量的语音字符(假名)和成千上万的汉字字符(是表意文字)用于描述普通句子。换句话说,日语句子不仅具有较大的字符集,而且还包括具有不同熵的字符。因此,难以将基于n元语法的常规方法应用于日语字符识别中的后处理。为了解决上述两个困难,我们提出了一种通过以下方式使用词性的方法。一种是通过根据每个汉字字符使用的语音部分对它们进行聚类来减少汉字字符的数量。另一种是通过将假名字符分类为具有词性属性的更详细的子类别来增加其假名的熵。 。我们将基于这两种技术的bigram方法应用于日语模型。实验得出以下两个结果:(1)我们的语言模型解决了假名和汉字之间的不平衡,并且当日语报纸文本(总共包含大约300万个字符)用于日语时,日语的困惑度降低到了100个以下(2)使用正确的字符作为候选者时,使用该模型进行在线字符识别的后处理可以纠正大约一半的替换错误。 [参考:10]

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号