【24h】

Text Extraction from Bills and Invoices

机译:从票据和发票中提取文本

获取原文
获取原文并翻译 | 示例

摘要

This research tries to find out a methodology through which any data from the daily-use printed bills and invoices can be extracted. The data from these bills or invoices can be used extensively later on - such as machine learning or statistical analysis. This research focuses on extraction of final bill-amount, itinerary, date and similar data from bills and invoices as they encapsulate an ample amount of information about the users purchases, likes or dislikes etc. Optical Character Recognition (OCR) technology is a system that provides a full alphanumeric recognition of printed or handwritten characters from images. Initially, OpenCV has been used to detect the bill or invoice from the image and filter out the unnecessary noise from the image. Then intermediate image is passed for further processing using Tesseract OCR engine, which is an optical character recognition engine. Tesseract intends to apply Text Segmentation in order to extract written text in various fonts and languages. Our methodology proves to be highly accurate while tested on a variety of input images of bills and invoices.
机译:这项研究试图找到一种方法,通过该方法可以从日常使用的印刷票据和发票中提取任何数据。这些账单或发票中的数据可以在以后广泛使用,例如机器学习或统计分析。这项研究着重于从票据和发票中提取最终票据金额,行程,日期和类似数据,因为它们封装了有关用户购买,喜欢或不喜欢等的大量信息。光学字符识别(OCR)技术是一种系统提供图像中印刷或手写字符的完整字母数字识别。最初,OpenCV已用于从图像中检测账单或发票,并从图像中滤除不必要的噪音。然后,使用Tesseract OCR引擎传递中间图像以进行进一步处理,该引擎是一种光学字符识别引擎。 Tesseract打算应用文本分割,以提取各种字体和语言的书面文本。在对各种票据和发票的输入图像进行测试时,我们的方法被证明是高度准确的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号