首页> 外文期刊>Literary & linguistic computing >Automatic extraction of catalog data from digital images of historical manuscripts
【24h】

Automatic extraction of catalog data from digital images of historical manuscripts

机译:从历史手稿的数字图像中自动提取目录数据

获取原文
获取原文并翻译 | 示例
           

摘要

The Cairo Genizah, discovered in the late 19th century, is a collection of handwritten historical documents containing approximately 350,000 fragments of mainly Jewish texts. The fragments are today spread out in more than seventy libraries and private collections worldwide, and there is an ongoing effort to document and catalog all extant fragments. We explore three levels of extraction of catalog data from digital images of the fragments. First, images should be captured in a way that permits standardized automatic processing. Second, the images can be processed to detect elements such as image foreground, regions of written text, and lines of the text, thereby allowing for the automatic assignment of conventional catalog measurements. Third, modern computer-vision tools and statistical inference techniques may be used to identify fragments that might originate from the same original codex. Such matched fragments, commonly referred to as 'joins', were heretofore identified manually by experts, and presumably only a small fraction of existing joins have been discovered to date. Overall, we present what might be the first effort to address all three levels successfully within a large-scale project, detailing the various design choices and describing the techniques and algorithms used for the Cairo Genizah digitization project.
机译:开罗Genizah于19世纪后期发现,是手写的历史文献的集合,其中包含约35万个主要是犹太文本的片段。如今,这些碎片已散布到全球70多个图书馆和私人馆藏中,并且正在努力记录和分类所有现存的碎片。我们探索从片段的数字图像中提取目录数据的三个级别。首先,应以允许标准化自动处理的方式捕获图像。其次,可以对图像进行处理以检测元素,例如图像前景,书面文本区域和文本行,从而可以自动分配常规目录度量。第三,现代计算机视觉工具和统计推断技术可用于识别可能源自同一原始抄本的片段。迄今为止,此类匹配的片段(通常称为“连接”)是由专家手动识别的,迄今为止,大概只有一小部分现有连接被发现。总体而言,我们将介绍在大型项目中成功解决所有三个级别的第一个工作,详细介绍各种设计选择并描述开罗Genizah数字化项目所使用的技术和算法。

著录项

  • 来源
    《Literary & linguistic computing》 |2013年第2期|315-330|共16页
  • 作者单位

    The Friedberg Genizah Project, Jerusalem, Israel;

    The Friedberg Genizah Project, Jerusalem, Israel;

    The Blavatnik School of Computer Science, Tel Aviv University,Ramat Aviv, Israel;

    The Blavatnic School of Computer Science,Tel Aviv University,Ramat Aviv;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号