...
首页> 外文期刊>International Journal of Computer Processing of Oriental Languages >Automatic Identification of Oriental and Other Scripts in Image Documents
【24h】

Automatic Identification of Oriental and Other Scripts in Image Documents

机译:自动识别图像文档中的东方文字和其他文字

获取原文
获取原文并翻译 | 示例
           

摘要

Increasing amount of paper documents are produced and received by many organizations. Frequently, they have to be digitized for electronic archiving and later information retrieval or data mining, requiring scanning and OCR. Since OCR techniques are language dependent, the language of the original document must be identified first by advanced technology. This paper describes two methods of identifying Oriental languages among four language groups, i.e. Oriental, Roman, Cyrillic, and Arabic. One method is based on features extracted from the shapes of words and letters, while the other is based on global analysis of text pieces using Gabor filters. Experimental results on hundreds of both clean and noisy documents indicate that the proposed classification approaches look quite promising. The use of linguistic analysis to enhance the results is also discussed.
机译:许多组织生产和接收越来越多的纸质文件。通常,必须对它们进行数字化以进行电子归档以及以后的信息检索或数据挖掘,这需要扫描和OCR。由于OCR技术取决于语言,因此必须先通过高级技术来识别原始文档的语言。本文介绍了两种识别东方语言的方法,可分为四个语言组,即东方,罗马,西里尔和阿拉伯。一种方法是基于从单词和字母的形状中提取的特征,而另一种方法是基于使用Gabor过滤器对文本片段进行的全局分析。在数百篇干净和嘈杂的文档上的实验结果表明,提出的分类方法看起来很有希望。还讨论了使用语言分析来增强结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号