【24h】

Recognition and Classification of Figures in PDF Documents

机译:PDF文档中图形的识别和分类

获取原文
获取原文并翻译 | 示例

摘要

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition.
机译:基于栅格的输入的图形识别可发现诸如线,箭头和圆之类的图元。本文着重于基于矢量的PDF文档中图形的图形识别。第一阶段包括提取与图形相对应的图形和文本基元。构造了一个解释器,以将PDF内容转换为一组独立的图形和文本对象(在Java中),从而摆脱了PDF文件的复杂性。第二阶段包括发现简单的图形实体,我们称其为字素,例如,一对满足某些几何约束的原始图形对象。第三阶段使用机器学习以字素统计为属性对图形进行分类。基于提振的学习者(Weka工具包中的LogitBoost)使用从BioMed Central期刊研究论文中提取的36个图形中提取的16种字素类型,在坚持一培训/测试中达到了100%的分类精度。该方法可以很容易地适应于光栅图形识别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号