首页> 外文会议>International Conference on Intelligent Systems and Control >Text recognition in bilingual machine printed image documents — Challenges and survey: A review on principal and crucial concerns of text extraction in bilingual printed images
【24h】

Text recognition in bilingual machine printed image documents — Challenges and survey: A review on principal and crucial concerns of text extraction in bilingual printed images

机译:双语机器印刷图像文档中的文本识别—挑战和调查:双语印刷图像中文本提取的主要和关键问题的综述

获取原文

摘要

In this digital world, accurate text identification and recognition has become an important key area of image document analysis and processing. Textual data, ranging from simple to complex images along with language variations - mono, bi, tri or multilingual scripts, is identified and extracted. This paper is designed to focus the challenges and complex issues of text recognition in bilingual machine printed imaged documents. Major crucial factors are discovered and mentioned which become the bottlenecks in correct and accurate recognition. With this, a hierarchical structure depicting three Classification Schemes (CS) A, B and C of bilingual printed imaged document is shown, where A, B and C are related to the content form, image mining and language or script determination. Some loopholes of OCR working are also discussed. To analyze the existing algorithms and methods, a survey is presented to focus on their critical issues, proposed solutions along with constraints and errors found during text processing. It leads to find out the shortcomings and limitations of different methods. Various specifications and factors found from the techniques are also shown as their characteristics and are compared relatively to distinguish them. It is observed that most of the existing methods are based on the classification schemes CS A-A1 and C-C1 and C2 and are designed for the script identification with 300 dpi gray scale image using SVM classifier.
机译:在这个数字世界中,准确的文本识别和识别已成为图像文档分析和处理的重要关键领域。识别并提取文本数据,从简单到复杂的图像以及语言变化(单,双,三或多语言脚本)。本文旨在重点关注双语机器打印的成像文档中文本识别的挑战和复杂问题。发现并提到了主要的关键因素,这些因素成为正确正确识别的瓶颈。这样,示出了描述双语印刷成像文档的三个分类方案(CS)A,B和C的分层结构,其中A,B和C与内容形式,图像挖掘以及语言或脚本确定有关。还讨论了OCR工作的一些漏洞。为了分析现有的算法和方法,我们进行了一项调查,重点关注它们的关键问题,建议的解决方案以及在文本处理过程中发现的约束和错误。它导致找出不同方法的缺点和局限性。从技术中发现的各种规格和因素也被显示为它们的特性,并进行了比较以区别它们。可以看出,大多数现有方法都基于分类方案CS A-A1和C-C1和C2,并被设计用于使用SVM分类器识别300 dpi灰度图像的脚本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号