Recognition and Classification of Figures in PDF Documents

机译：PDF文档中图形的识别和分类

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition.

机译：基于栅格的输入的图形识别可发现诸如线，箭头和圆之类的图元。本文着重于基于矢量的PDF文档中图形的图形识别。第一阶段包括提取与图形相对应的图形和文本基元。构造了一个解释器，以将PDF内容转换为一组独立的图形和文本对象（在Java中），从而摆脱了PDF文件的复杂性。第二阶段包括发现简单的图形实体，我们称其为字素，例如，一对满足某些几何约束的原始图形对象。第三阶段使用机器学习以字素统计为属性对图形进行分类。基于提振的学习者（Weka工具包中的LogitBoost）使用从BioMed Central期刊研究论文中提取的36个图形中提取的16种字素类型，在坚持一培训/测试中达到了100％的分类精度。该方法可以很容易地适应于光栅图形识别。

著录项

来源
《International Workshop on Graphics Recognition(GREC 2005); 20050825-26; Hong Kong(CN)》|2005年|P.231-242|共12页
会议地点 Hong Kong(CN)
作者
Mingyan Shao; Robert P. Futrelle;
展开▼
作者单位

Northeastern University, Boston, MA 02115, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理（信息加工）;
关键词
graphics recognition; PDF; graphemes; vector graphics; machine learning; boosting;

机译：图形识别; PDF;字素;矢量图形;机器学习;增强;

相似文献

外文文献
中文文献
专利

1. Data-Driven Recognition and Extraction of PDF Document Elements [J] . Matthias Hansen, André Pomp, Kemal Erki, Technologies . 2019,第3期

机译：数据驱动的PDF文档元素的识别和提取
2. Document recognition by real-time classifications of character images and reduction of correction labor of recognition results [J] . Eriko Ando, Masakazu Suzuki 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：通过字符图像的实时分类进行文件识别，减少识别结果的校正工作
3. Document recognition by real-time classifications of character images and reduction of correction labor of recognition results [J] . Eriko Ando, Masakazu Suzuki 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2001,第711期

机译：通过字符图像的实时分类进行文件识别，减少识别结果的校正工作
4. Recognition and Classification of Figures in PDF Documents [C] . Mingyan Shao, Robert P. Futrelle International Workshop on Graphics Recognition . 2006

机译：PDF文件中数据的认可和分类
5. AUTOMATIC DOCUMENT CLASSIFICATION THEORY--A PATTERN RECOGNITION APPROACH [D] . TAYLOR, RAWDON MONTGOMERIE. -1

机译：自动文档分类理论-一种模式识别方法
6. Embedding and Publishing Interactive 3-Dimensional Scientific Figures in Portable Document Format (PDF) Files [O] . David G. Barnes, Michail Vidiassov, Bernhard Ruthensteiner, -1

机译：以便携式文档格式（PDF）文件嵌入和发布交互式三维科学图形
7. Recognition and Classification of Figures in PDF Documents [O] . Mingyan Shao, Robert P. Futrelle 2006

机译：PDF文件中数据的认可和分类

Recognition and Classification of Figures in PDF Documents

摘要

著录项

相似文献

相关主题

期刊订阅