首页> 外文会议>International Symposium on Low Power Electronics and Design >An Automatic Method to Extract Data from an Electronic Contract Composed of a Number of Documents in PDF Format
【24h】

An Automatic Method to Extract Data from an Electronic Contract Composed of a Number of Documents in PDF Format

机译:一种自动从PDF格式组成的电子合同中提取数据的自动方法

获取原文

摘要

An electronic contract can encompass a large number of collateral contract documents in PDF format. These contract documents are of different contract document types and converted from different original formats. Data extraction and thus data mining for this kind of electronic contracts is very difficult. In this paper, we present a novel method to automatically extract contract data from this kind of electronic contracts. Our automatic electronic contract data extraction system comprises an administrator module, a PDF parser, a pattern recognition engine and a contract data extraction engine. The administrator module provides templates for inputting document patterns and a list of contract data tags for each contract document type. It also constructs the pattern matrices and stores them in a database. The PDF parser converts the contract PDF document into the contract text document with the insertion of formatting bookmarks, such as a new page, paragraph or line. The pattern recognition engine determines a list of contract document types in the electronic contract by comparing and matching the patterns of all known contract document types with the pattern of the contract text document. The contract data extraction engine retrieves the corresponding list of contract data tags and then extracts contract data accordingly for each contract document type on the list. Our automatic electronic contract data extraction system has found to be very accurate, efficient and useful in extracting contract data for data mining
机译:电子合同可以包括PDF格式的大量抵押合同文件。这些合同文件是不同的合同文件类型,并从不同的原始格式转换。数据提取,从而为这种电子合同进行数据挖掘是非常困难的。在本文中,我们提出了一种新的方法,可以从这种电子合同中自动提取合同数据。我们的自动电子合同数据提取系统包括管理员模块,PDF解析器,模式识别引擎和合同数据提取引擎。管理员模块提供用于输入文档模式的模板以及每个合同文档类型的合同数据标签列表。它还构造了模式矩阵并将它们存储在数据库中。 PDF解析器将合同PDF文档转换为合同文本文档,并插入格式书签,例如新页面,段落或行。模式识别引擎通过将所有已知合同文档类型的模式与合同文本文件的模式进行比较并匹配电子合同中的合同文档类型列表。合同数据提取引擎检索相应的合同数据标签列表,然后在列表上为每个合同文档类型进行相应提取合同数据。我们的自动电子合同数据提取系统已发现非常准确,高效,有助于提取数据挖掘的合同数据

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号