首页> 外文会议>Fourth workshop on noisy user-generated text >Combining Human and Machine Transcriptions on the Zooniverse Platform
【24h】

Combining Human and Machine Transcriptions on the Zooniverse Platform

机译:在Zooniverse平台上结合人类和机器转录

获取原文
获取原文并翻译 | 示例

摘要

Transcribing handwritten documents to create fully searchable texts is an essential part of the archival process. Traditional text recognition methods, such as optical character recognition (OCR), do not work on handwritten documents due to their frequent noisiness and OCR's need for individually segmented letters. Crowdsourcing and improved machine models are two modern methods for transcribing handwritten documents. Transcription projects on Zooniverse, a platform for crowdsourced research, generally involve three steps: 1) Volunteers identify lines of text; 2) Volunteers type out the text associated with a marked line; 3) Researchers combine raw transcription data to generate a consensus. This works well, but projects generally require 10-15 volunteer transcriptions per document to ensure accuracy and coverage, which can be time-consuming. Modern machine models for handwritten text recognition use neural networks to transcribe full lines of unsegmented text. These models have high accuracy on standard datasets (Sanchez et al., 2014), but do not generalize well (Messina and Louradour, 2015; Moysset et al., 2014). While modern techniques substantially improve our ability to collect data, humans are limited in speed and computers are limited in accuracy. Therefore, by combining human and machine classifiers we obtain the most efficient transcription system. We created a deep neural network and pre-trained it on two publicly available datasets: the IAM Handwriting Database and the Bentham Collection at University College, London. This pre-trained model served as a baseline from which we could further train the model on new data. Using data collected from the crowdsourcing project "Anti-Slavery Manuscripts at the Boston Public Library," we re-trained the model in a pseudo-online fashion. Specifically, we took existing data, but supplied it to the model in small batches, in the same order it was collected. To test the model's predictive accuracy, we predicted each new line of text from a batch of data before training the model on that data. After training on 90,000 lines of text, the model had an error rate of 12% on previously unseen data. This is slightly higher than other studies (Sanchez et al., 2014; Sanchez et al., 2015; Sanchez et al., 2016) which generally worked with cleaner, more curated data, potentially explaining the difference. This error rate also exceeds the 2.5% error rate achieved by volunteers when compared to experts. Nonetheless, the model performed identically to human performance in many cases, which can be used to improve transcription speed, if not accuracy. We plan to incorporate this model into the human transcription process by showing the predicted transcriptions to volunteers as they transcribe. Much of the infrastructure already exists within Zooniverse due to the work on collaborative transcription done within the Anti-Slavery Manuscripts project. By showing volunteers the machine prediction, there are many opportunities for improving efficiency. If the computer prediction is correct, the volunteer can agree with it without retyping the whole line. If the volunteer does not agree, they can either correct it, or completely redo the transcription, ensuring high accuracy. This process will also improve model performance by allowing us to focus model training on more difficult text.
机译:抄写手写文档以创建完全可搜索的文本是归档过程的重要组成部分。传统的文本识别方法(例如光学字符识别(OCR))由于经常出现噪音并且OCR需要单独分段的字母,因此无法在手写文档上使用。众包和改进的机器模型是转录手写文档的两种现代方法。在众包研究平台Zooniverse上的转录项目通常包括三个步骤:1)志愿者识别文本行; 2)志愿者键入带有标记行的文本; 3)研究人员结合原始转录数据以产生共识。这很好用,但是项目通常需要每个文档10-15个志愿者抄录,以确保准确性和覆盖范围,这可能很耗时。用于手写文本识别的现代机器模型使用神经网络来转录未分割文本的整行。这些模型在标准数据集上具有较高的准确性(Sanchez等,2014),但推广效果不佳(Messina和Louradour,2015; Moysset等,2014)。尽管现代技术大大提高了我们收集数据的能力,但是人类的速度受到限制,计算机的准确性受到限制。因此,通过结合人和机器分类器,我们可以获得最有效的转录系统。我们创建了一个深度神经网络,并在两个公开可用的数据集上进行了预训练:IAM手写数据库和伦敦大学学院的Bentham收藏。这个预先训练的模型作为基线,我们可以从中进一步在新数据上训练模型。使用从众包项目“波士顿公共图书馆的反奴隶制手稿”收集的数据,我们以伪在线方式对模型进行了重新训练。具体来说,我们获取了现有数据,但按照收集的相同顺序将其小批量提供给模型。为了测试模型的预测准确性,我们在对模型进行数据训练之前,从一批数据中预测了每一行新文本。在对90,000行文本进行训练之后,该模型对以前看不见的数据的错误率为12%。这略高于其他研究(Sanchez等人,2014; Sanchez等人,2015; Sanchez等人,2016),这些研究通常使用更干净,更合理的数据,可能解释了差异。与专家相比,该错误率还超过了志愿者实现的2.5%的错误率。尽管如此,该模型在许多情况下的表现与人类的表现相同,即使准确性不高,也可用于提高转录速度。我们计划通过向志愿者展示其预测的转录,从而将该模型整合到人类转录过程中。由于在反奴隶制手稿项目中完成了有关协作转录的工作,因此Zooniverse内部已经存在许多基础设施。通过向志愿者展示机器预测,有很多提高效率的机会。如果计算机预测正确,则志愿者可以同意而无需重新键入整行内容。如果志愿者不同意,他们可以改正或完全重做转录,以确保准确性。通过允许我们将模型训练的重点放在更困难的文本上,此过程还将提高模型的性能。

著录项

  • 来源
  • 会议地点 Brussels(BE)
  • 作者单位

    University of Minnesota - Tate Laboratory, 116 Church St SE, Minneapolis, MN 55455;

    University of Minnesota - Tate Laboratory, 116 Church St SE, Minneapolis, MN 55455;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号