Text Extraction from Bills and Invoices

机译：从票据和发票中提取文本

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This research tries to find out a methodology through which any data from the daily-use printed bills and invoices can be extracted. The data from these bills or invoices can be used extensively later on - such as machine learning or statistical analysis. This research focuses on extraction of final bill-amount, itinerary, date and similar data from bills and invoices as they encapsulate an ample amount of information about the users purchases, likes or dislikes etc. Optical Character Recognition (OCR) technology is a system that provides a full alphanumeric recognition of printed or handwritten characters from images. Initially, OpenCV has been used to detect the bill or invoice from the image and filter out the unnecessary noise from the image. Then intermediate image is passed for further processing using Tesseract OCR engine, which is an optical character recognition engine. Tesseract intends to apply Text Segmentation in order to extract written text in various fonts and languages. Our methodology proves to be highly accurate while tested on a variety of input images of bills and invoices.

机译：这项研究试图找到一种方法，通过该方法可以从日常使用的印刷票据和发票中提取任何数据。这些账单或发票中的数据可以在以后广泛使用，例如机器学习或统计分析。这项研究着重于从票据和发票中提取最终票据金额，行程，日期和类似数据，因为它们封装了有关用户购买，喜欢或不喜欢等的大量信息。光学字符识别（OCR）技术是一种系统提供图像中印刷或手写字符的完整字母数字识别。最初，OpenCV已用于从图像中检测账单或发票，并从图像中滤除不必要的噪音。然后，使用Tesseract OCR引擎传递中间图像以进行进一步处理，该引擎是一种光学字符识别引擎。 Tesseract打算应用文本分割，以提取各种字体和语言的书面文本。在对各种票据和发票的输入图像进行测试时，我们的方法被证明是高度准确的。

著录项

来源
《2018 International Conference on Advances in Computing, Communication Control and Networking》|2018年|564-568|共5页
会议地点 Greater Noida(IN)
作者
Harshit Sidhwa; Sudhanshu Kulshrestha; Sahil Malhotra; Shivani Virmani;
展开▼
作者单位

Department of Computer Science and Engineering, Jaypee Institute of Information Technology, Noida, India;

Department of Computer Science and Engineering, Jaypee Institute of Information Technology, Noida, India;

Department of Computer Science and Engineering, Jaypee Institute of Information Technology, Noida, India;

Department of Computer Science and Engineering, Jaypee Institute of Information Technology, Noida, India;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Image segmentation; Optical character recognition software; Engines; Character recognition; Optical filters; Image edge detection; Optical imaging;

机译：图像分割;光学字符识别软件;引擎;字符识别;光学滤波器;图像边缘检测;光学成像;;

相似文献

外文文献
中文文献
专利

1. Text Extraction and Recognition from the Normal Images using MSER Feature Extraction and Text Segmentation Methods [J] . Nitin Sharma, Nidhi Indian Journal of Science and Technology . 2017,第17期

机译：使用MSER特征提取和文本分割方法从普通图像中提取和识别文本
2. TEXT MINING ALGORITHM DISCOTEX (DIS-COVERY FROM TEXT EXTRACTION) WITH INFORMATION EXTRACTION [J] . Dr.T..LALITHA, S.MEENAKSHI Journal of Theoretical and Applied Information Technology . 2014,第2期

机译：具有信息提取功能的文本挖掘算法DISCOTEX（来自文本提取的发现）
3. TEXT MINING ALGORITHM DISCOTEX (DIS-COVERY FROM TEXT EXTRACTION) WITH INFORMATION EXTRACTION [J] . Dr.T..LALITHA, S.MEENAKSHI Journal of Theoretical and Applied Information Technology . 2014,第2期

机译：具有信息提取功能的文本挖掘算法DISCOTEX（来自文本提取的发现）
4. Analysis of Image Classification for Text Extraction from Bills and Invoices [C] . Yindumathi K M, Shilpa Shashikant Chaudhari, Aparna R International Conference on Computing, Communication and Networking Technologies . 2020

机译：从票据和发票中提取文本的图像分类分析
5. Knowledge Extraction and Analysis of Medical Text with Particular Emphasis on Medical Guidelines [D] . Hematialam, Hossein. 2021

机译：特别强调医疗指南的知识提取与分析
6. Layout-aware text extraction from full-text PDF of scientific articles [O] . Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, 2012

机译：从科学文章的全文PDF中提取可识别布局的文本
7. Design and Implementation of Dehong Local Taxation Machine-printed Invoices Billing Management System [O] . 李继明 2013

机译：德宏地方税机打发票管理系统的设计与实现

Text Extraction from Bills and Invoices

摘要

著录项

相似文献

相关主题

期刊订阅