首页> 外文会议>International conference on very large data bases >Probabilistic Management of OCR Data using an RDBMS
【24h】

Probabilistic Management of OCR Data using an RDBMS

机译:使用RDBMS的OCR数据的概率管理

获取原文

摘要

The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over lOOOx slower versus standard text processing for single table select-project queries. However, many applications may have quality-performance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called Staccato that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme's properties, and describe how we integrate our scheme with standard-RDBMS text indexing.
机译:扫描形式和文档的数字化正在改变企业管理的数据来源。为了将这些新数据源与企业数据集成,当前的最先进的方法是使用光学字符识别(OCR)软件将图像转换为ASCII文本,然后将结果的ASCII文本存储在关系数据库中。 OCR问题是具有挑战性的,因此OCR的输出通常包含错误。反过来,对OCR输出的查询可能无法检索相关答案。最先进的OCR程序,例如,OCR供电Google书籍,使用概率模型,该模型在OCR过程中捕获许多替代方案。只有当OCR的结果存储在数据库中时,这些方法才会丢弃不确定性。在这项工作中,我们建议保留在关系数据库管理系统中由OCR过程产生的概率模型。一个关键的技术挑战是OCR软件产生的概率数据非常大(一本书从400KB为ASCII播出高达2GB)。因此,将这些模型与RDBMS集成的基线解决方案是对单表选择项目查询的LOOOX较慢与标准文本处理。但是,许多应用程序可能具有在这两个极端的ASCII之间的质量性能需求和OCR软件的完整模型。因此,我们提出了一种名为STAccato的新颖近似方案,允许用户交易调用以进行查询性能。此外,我们还提供了对我们方案的性质的正式分析,并描述了如何将我们的计划与标准-RDBMS文本索引进行整合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号