首页> 外文会议>International conference on human-computer interaction >A Study on Document Retrieval System Based on Visualization to Manage OCR Documents
【24h】

A Study on Document Retrieval System Based on Visualization to Manage OCR Documents

机译:基于可视化的OCR文档管理文档检索系统研究

获取原文

摘要

Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.
机译:近年来,随着扫描仪的普及,纸质文档的数字化迅速发展。然而,就时间和精力而言,很难一一地标记或分类大量的扫描文档。因此,自动从文档中的文本中提取特征(OCR可用)并对文档进行分类/检索的系统将非常有用。 LDA是最流行的主题模型之一,被称为一种提取每个文档的特征以及文档之间关系的方法。但是,据报道,LDA的性能随着OCR识别能力的下降而下降。本文假设将LDA应用于日语OCR文档,并研究了提高主题推理性能的方法。本文利用N-gram定义了识别词的可靠性,并提出了基于可靠性的加权LDA方法。通过检测伪造的识别词的初步实验,确认了识别词的可靠性是否足够,然后进行了对实际OCR文档进行分类的实验。实验结果表明,与常规方法相比,该方法提高了分类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号