JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION

Itoh N.

首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION

【24h】

JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION

机译：基于图形的日语语言模型及其在在线字符识别中的应用

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper deals with a postprocessing method based on the n-gram approach for Japanese character recognition. In Japanese a small number of phonetic characters (Kana) and thousands of Kanji characters, which are ideographs, are used for describing ordinary sentences. In other words, Japanese sentences not only have a large character set, but also include characters with different entropies. It is therefore difficult to apply conventional methodologies based on n-grams to postprocessing in Japanese character recognition. In order to resolve the above two difficulties, we propose a method that uses parts of speech in the Following ways. One is to reduce the number of Kanji characters by clustering them according to the parts of speech that each Kanji character is used in. Another is to increase the entropy of a Kana character by classifying it into more detailed subcategories with part-of-speech attributes. We applied a bigram approach based on these two techniques to a Japanese language model. Experiments yielded the following two results: (1) our language model resolved the imbalance between Kana and Kanji characters, and reduced the perplexity of Japanese to less than 100, when Japanese newspaper texts (containing a total of approximately three million characters) were used for the learning of our model, and (2) the postprocessing using the model for on-line character recognition rectified about half of all substitution errors when the correct characters were among the candidates. [References: 10]

机译：本文讨论了一种基于n-gram方法的日语字符识别后处理方法。在日语中，少量的语音字符（假名）和成千上万的汉字字符（是表意文字）用于描述普通句子。换句话说，日语句子不仅具有较大的字符集，而且还包括具有不同熵的字符。因此，难以将基于n元语法的常规方法应用于日语字符识别中的后处理。为了解决上述两个困难，我们提出了一种通过以下方式使用词性的方法。一种是通过根据每个汉字字符使用的语音部分对它们进行聚类来减少汉字字符的数量。另一种是通过将假名字符分类为具有词性属性的更详细的子类别来增加其假名的熵。。我们将基于这两种技术的bigram方法应用于日语模型。实验得出以下两个结果：（1）我们的语言模型解决了假名和汉字之间的不平衡，并且当日语报纸文本（总共包含大约300万个字符）用于日语时，日语的困惑度降低到了100个以下（2）使用正确的字符作为候选者时，使用该模型进行在线字符识别的后处理可以纠正大约一半的替换错误。 [参考：10]

著录项

来源
《Pattern Recognition: The Journal of the Pattern Recognition Society》 |1995年第2期|共7页
作者
Itoh N.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化技术及设备;
关键词
N-gram; Postprocessing; On-line character recognition; Language model; Japanese; Morphological analysis; Part-of-speech;

机译：N-gram;后处理;在线字符识别;语言模型;日语;形态分析;词性;

相似文献

外文文献
中文文献
专利

1. JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION [J] . Itoh N. Pattern Recognition: The Journal of the Pattern Recognition Society . 1995,第2期

机译：基于图形的日语语言模型及其在在线字符识别中的应用
2. On-line Chinese Handwriting Character Recognition: Comparison with Japanese Kanji Recognition and Improvement of Input Efficiency [J] . HAJIME NAMBU, TAKENORI KAWAMATA, FUYUKI MARUYAMA 情報処理学会論文誌 . 1999,第8期

机译：在线中文手写字符识别：与日语汉字识别的比较和输入效率的提高
3. Ghost Character Recognition Theory and Arabie Script Based Languages Character Recognition [J] . Muhammad Imran RAZZAK, Abdulrahman A. MIRZA Przeglad Elektrotechniczny . 2011,第11期

机译：鬼字符识别理论和基于阿拉伯脚本的语言字符识别
4. Objective Function Design for MCE-Based Combination of On-line and Off-line Character Recognizers for On-line Handwritten Japanese Text Recognition [C] . Zhu Bilan, Gao JinFeng, Nakagawa Masaki 2011 International Conference on Document Analysis and Recognition . 2011

机译：基于MCE的联机和脱机字符识别器组合的目标功能设计，用于在线手写日语文本识别
5. Enabling windows applications for the Japanese language: With double byte characters [D] . Murch, Frank B. 1995

机译：为日语启用Windows应用程序：具有双字节字符
6. Feature Selection Method Based on Neighborhood Relationships: Applications in EEG Signal Identification and Chinese Character Recognition [O] . Yu-Xiang Zhao, Chien-Hsing Chou 2016

机译：基于邻域关系的特征选择方法：在脑电信号识别和汉字识别中的应用
7. Arabic text recognition of printed manuscripts. Efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing. [O] . Al-Muhtaseb Husni Abdulghani 2010

机译：印刷品的阿拉伯文字识别。使用隐马尔可夫模型，Bigram统计语言模型和后处理可有效识别离线印刷的阿拉伯文本。

JAPANESE LANGUAGE MODEL BASED ON BIGRAMS AND ITS APPLICATION TO ON-LINE CHARACTER RECOGNITION

摘要

著录项

相似文献

相关主题

期刊订阅