首页> 外文会议>IAPR International Workshop on Document Analysis Systems >QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries
【24h】

QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

机译:QATIP-图书馆阿拉伯文物馆藏的光学字符识别系统

获取原文

摘要

Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
机译:如今,商业光学字符识别(OCR)软件在对现代阿拉伯文档进行高质量扫描时已经达到了很高的准确性。但是,图书馆中大部分阿拉伯文物收藏通常更具挑战性,例如由打字文件,早期印刷品和历史手稿组成。在本文中,我们在此类文档中介绍了面向OCR的面向最终用户的QATIP系统。识别基于Kaldi工具包和复杂的文本图像规范化。本文包含两个主要贡献:首先,我们描述了用于库的QATIP界面,该界面由用于添加和监视作业的图形用户界面以及用于自动访问的Web API组成。第二,我们建议用于连续阿拉伯语OCR的语言建模和连字建模的新颖方法。我们会在早期版本和历史手稿上测试我们的QATIP系统,并报告实质性的改进-例如QATIP的字符错误率为12.6%,而在我们的实验装置(Tesseract)中,最佳OCR产品的字符错误率为51.8%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号