首页> 美国政府科技报告 >Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF
【24h】

Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF

机译:pDF生成方法的比较研究:将阿拉伯语和波斯语ms Word文件转换为pDF时测量保真度的损失

获取原文

摘要

Converting files to Portable Document Format (PDF) is popular due to the format's many advantages. For example, PDF allows an author to control or preserve the rendering of a digital document, distribute it to other systems, and ensure that it displays in a viewer as intended. From the perspective of Human Language Technology (HLT), however, PDFs are problematic. PDF is a display-oriented digital document format; the point of PDF is to preserve the appearance of a document, not to preserve the original electronic text. We observed errors in PDF-extracted text indicating that either the PDF generator or extractor, or both, mishandled the document structure, character data, and/or entire textual objects. And we learned that other HLT researchers reported data loss when extracting electronic text from PDFs. This motivated further study of digital document data exchange using PDFs. MITRE conducted an exploratory study of data exchange using PDF in order to investigate the data loss phenomenon. We limited our study to Middle Eastern electronic text: specifically Arabic and Persian. The study included a test for scoring PDF generation methods: (a) using a common, best-practice setup to generate PDFs and extract text, and (b) using character accuracy to quantify the quality of PDF- extracted text. We ranked 8 methods according to the resulting accuracy scores. The 8 methods map to 3 core PDF generation classes. At best, the Microsoft Word class resulted in 42% Overall Accuracy. Best scores for the PDFMaker and Acrobat Distiller/PScript5.dll classes were 95% and 96%, respectively. This paper explains our tests and discusses the results, including evidence that using PDF for data exchange of typical Arabic and Persian documents results in a loss of important electronic text content. This loss confuses human language technologies such as search engines, machine translati.

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号