Probabilistic Management of OCR Data using an RDBMS

机译：使用RDBMS的OCR数据的概率管理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over lOOOx slower versus standard text processing for single table select-project queries. However, many applications may have quality-performance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called Staccato that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme's properties, and describe how we integrate our scheme with standard-RDBMS text indexing.

机译：扫描形式和文档的数字化正在改变企业管理的数据来源。为了将这些新数据源与企业数据集成，当前的最先进的方法是使用光学字符识别（OCR）软件将图像转换为ASCII文本，然后将结果的ASCII文本存储在关系数据库中。 OCR问题是具有挑战性的，因此OCR的输出通常包含错误。反过来，对OCR输出的查询可能无法检索相关答案。最先进的OCR程序，例如，OCR供电Google书籍，使用概率模型，该模型在OCR过程中捕获许多替代方案。只有当OCR的结果存储在数据库中时，这些方法才会丢弃不确定性。在这项工作中，我们建议保留在关系数据库管理系统中由OCR过程产生的概率模型。一个关键的技术挑战是OCR软件产生的概率数据非常大（一本书从400KB为ASCII播出高达2GB）。因此，将这些模型与RDBMS集成的基线解决方案是对单表选择项目查询的LOOOX较慢与标准文本处理。但是，许多应用程序可能具有在这两个极端的ASCII之间的质量性能需求和OCR软件的完整模型。因此，我们提出了一种名为STAccato的新颖近似方案，允许用户交易调用以进行查询性能。此外，我们还提供了对我们方案的性质的正式分析，并描述了如何将我们的计划与标准-RDBMS文本索引进行整合。

著录项

来源
《International conference on very large data bases》|2012年||共12页
会议地点
作者
Arun Kumar; Christopher Re;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process [J] . Kimmo Kettunen, Mika Koistinen, Jukka Kervinen LIBER Quarterly - Journal of European Research Libraries . 2020,第1期

机译：芬兰历史报纸的地面真理OCR样本数据改进验证重新响应的验证
2. Use LC LSX to Populate a Notes Database with Data from an RDBMS [J] . Michael Sobczak Lotus Advisor . 2007,第6期

机译：使用LC LSX用来自RDBMS的数据填充Notes数据库
3. Implementation of GIS and Geographic RDBMS Prototype for Water Resources Management. Zeuss-Koutine Basin (South of Tunisia) [J] . Khemiri Sami, Mansouri Safa, Khnissi Afef, Journal of Geographic Information System . 2013,第5期

机译：用于水资源管理的GIS和地理RDBMS原型的实现。宙斯-科廷盆地（突尼斯南部）
4. Probabilistic Management of OCR Data using an RDBMS [C] . Arun Kumar, Christopher Re International conference on very large data bases . 2012

机译：使用RDBMS进行OCR数据的概率管理
5. Probabilistic methods for searching OCR-degraded Arabic text. [D] . Darwish, Kareem M. 2003

机译：用于搜索OCR降级的阿拉伯文本的概率方法。
6. MiBio: A dataset for OCR post-processing evaluation [O] . Jie Mei, Aminul Islam, Abidalrahman Moh’d, 2018

机译：MiBio：OCR后处理评估的数据集
7. Probabilistic Management of OCR Data using an RDBMS [O] . Arun Kumar, Christopher Ré 2011

机译：使用RDBMS进行OCR数据的概率管理
8. Post-Dam System. Volume 4. Relational Data Base Management System (RDBMS). [R] . Warren, T. L., Howard, J. J., Merkle, D. H. 1992

机译：后坝系统。第4卷。关系数据库管理系统（RDBms）。

Probabilistic Management of OCR Data using an RDBMS

摘要

著录项

相似文献

相关主题

期刊订阅