A Study on Document Retrieval System Based on Visualization to Manage OCR Documents

机译：基于可视化的OCR文档管理文档检索系统研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.

机译：近年来，随着扫描仪的普及，纸质文档的数字化迅速发展。然而，就时间和精力而言，很难一一地标记或分类大量的扫描文档。因此，自动从文档中的文本中提取特征（OCR可用）并对文档进行分类/检索的系统将非常有用。 LDA是最流行的主题模型之一，被称为一种提取每个文档的特征以及文档之间关系的方法。但是，据报道，LDA的性能随着OCR识别能力的下降而下降。本文假设将LDA应用于日语OCR文档，并研究了提高主题推理性能的方法。本文利用N-gram定义了识别词的可靠性，并提出了基于可靠性的加权LDA方法。通过检测伪造的识别词的初步实验，确认了识别词的可靠性是否足够，然后进行了对实际OCR文档进行分类的实验。实验结果表明，与常规方法相比，该方法提高了分类性能。

著录项

来源
《International conference on human-computer interaction》|2013年|740-749|共10页
会议地点
作者
Kazuki Tamura; Tomohiro Yoshikawa; Takeshi Furuhashi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A study on document retrieval system for large-scale database based on OCR and character shape information [J] . Taizo Kameshiro, Yoshinori Yamagishi, Takashi Hirano, 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：基于OCR和字符形状信息的大型数据库文档检索系统研究
2. A study on document retrieval system for large-scale database based on OCR and character shape information [J] . Taizo Kameshiro, Yoshinori Yamagishi, Takashi Hirano, 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：基于OCR和字符形状信息的大型数据库文档检索系统研究
3. A study on document retrieval system for large-scale database based on OCR and character shape information [J] . Taizo Kameshiro, Yoshinori Yamagishi, Takashi Hirano, 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2001,第711期

机译：基于OCR和字符形状信息的大规模数据库文档检索系统研究
4. A Study on Document Retrieval System Based on Visualization to Manage OCR Documents [C] . Kazuki Tamura, Tomohiro Yoshikawa, Takeshi Furuhashi HCI International 2013 . 2013

机译：基于可视化管理OCR文档的文档检索系统研究
5. Visualization of search engine query result using region-based document model on XML documents. [D] . Parikh, Sunish Umesh. 2000

机译：在XML文档上使用基于区域的文档模型来可视化搜索引擎查询结果。
6. Screening Consolidated Clinical Document Architecture (CCDA) Documents for Sensitive Data Using a Rule-Based Decision Support System [O] . Beatriz H. Rocha, Deepika Pabbathi, Molly Schaeffer, 2017

机译：使用基于规则的决策支持系统筛选敏感数据的合并临床文档架构（CCDA）文档
7. Information Retrieval for OCR Documents: A Content-based Probabilistic Correction Model [O] . Rong Jin, Chengxiang Zhai, Alex G. Hauptmann 2008

机译：OCR文档的信息检索：基于内容的概率校正模型
8. Model Based Restoration of Document Images for OCR [R] . M. Y. Jaisimha, Eve A. Riskin, Richard Ladner 1996

机译：基于模型的OCR文档图像恢复

A Study on Document Retrieval System Based on Visualization to Manage OCR Documents

摘要

著录项

相似文献

相关主题

期刊订阅