首页> 外文会议>Conference on empirical methods in natural language processing >Combining Human and Machine Transcriptions on the Zooniverse Platform
【24h】

Combining Human and Machine Transcriptions on the Zooniverse Platform

机译:将人类和机器转录组合在Zooniverse平台上

获取原文

摘要

Transcribing handwritten documents to create fully searchable texts is an essential part of the archival process. Traditional text recognition methods, such as optical character recognition (OCR), do not work on handwritten documents due to their frequent noisiness and OCR's need for individually segmented letters. Crowdsourcing and improved machine models are two modern methods for transcribing handwritten documents. Transcription projects on Zooniverse, a platform for crowdsourced research, generally involve three steps: 1) Volunteers identify lines of text; 2) Volunteers type out the text associated with a marked line; 3) Researchers combine raw transcription data to generate a consensus. This works well, but projects generally require 10-15 volunteer transcriptions per document to ensure accuracy and coverage, which can be time-consuming. Modern machine models for handwritten text recognition use neural networks to transcribe full lines of unsegmented text. These models have high accuracy on standard datasets (Sanchez et al., 2014), but do not generalize well (Messina and Louradour, 2015; Moysset et al., 2014). While modern techniques substantially improve our ability to collect data, humans are limited in speed and computers are limited in accuracy. Therefore, by combining human and machine classifiers we obtain the most efficient transcription system. We created a deep neural network and pre-trained it on two publicly available datasets: the IAM Handwriting Database and the Bentham Collection at University College, London. This pre-trained model served as a baseline from which we could further train the model on new data. Using data collected from the crowdsourcing project "Anti-Slavery Manuscripts at the Boston Public Library," we re-trained the model in a pseudo-online fashion. Specifically, we took existing data, but supplied it to the model in small batches, in the same order it was collected. To test the model's predictive accuracy, we predicted each new line of text from a batch of data before training the model on that data. After training on 90,000 lines of text, the model had an error rate of 12% on previously unseen data. This is slightly higher than other studies (Sanchez et al., 2014; Sanchez et al., 2015; Sanchez et al., 2016) which generally worked with cleaner, more curated data, potentially explaining the difference. This error rate also exceeds the 2.5% error rate achieved by volunteers when compared to experts. Nonetheless, the model performed identically to human performance in many cases, which can be used to improve transcription speed, if not accuracy. We plan to incorporate this model into the human transcription process by showing the predicted transcriptions to volunteers as they transcribe. Much of the infrastructure already exists within Zooniverse due to the work on collaborative transcription done within the Anti-Slavery Manuscripts project. By showing volunteers the machine prediction, there are many opportunities for improving efficiency. If the computer prediction is correct, the volunteer can agree with it without retyping the whole line. If the volunteer does not agree, they can either correct it, or completely redo the transcription, ensuring high accuracy. This process will also improve model performance by allowing us to focus model training on more difficult text.
机译:转录手写文档以创建完全可搜索的文本是归档过程的重要组成部分。传统的文本识别方法,例如光学字符识别(OCR),由于它们频繁的噪音和OCR对单独分段的信件而不是手写文档。众包和改进的机器型号是两种用于转录手写文件的现代方法。 Zooniverse的转录项目,众包研究平台,一般涉及三个步骤:1)志愿者识别文本的行; 2)志愿者键入与标记线相关的文本; 3)研究人员将原始转录数据组合以产生共识。这效果很好,但项目通常需要每份文件10-15志愿者转录,以确保准确性和覆盖范围,这可能是耗时的。手写文本识别的现代机器模型使用神经网络来传输无分段文本的全行。这些型号在标准数据集中具有高精度(Sanchez等,2014),但不概括(Messina和Louradour,2015; Moysset等,2014)。虽然现代技术大大提高了我们收集数据的能力,但人类的速度有限,计算机的精度有限。因此,通过组合人和机器分类器,我们获得最有效的转录系统。我们创建了一个深度神经网络,并在两个公共可用的数据集中预先培训:IAM手写数据库和伦敦大学学院的Bentham集合。这款预先接受的模型作为基准,我们可以进一步培训新数据的模型。使用从众包项目中收集的数据“在波士顿公共图书馆的反奴隶制手稿”,我们以伪在线方式重新培训了该模型。具体而言,我们采取了现有数据,但以小批次向模型提供给模型,以与收集的顺序相同。为了测试模型的预测准确性,我们在培训该数据的模型之前从一批数据中预测每个新的文本行。在培训90,000行文本后,该模型在以前看不见的数据上的错误率为12%。这略高于其他研究(Sanchez等,2014年; Sanchez等,2015;桑切斯等人,2016),它通常与更清洁,更加策划的数据合作,可能解释差异。与专家相比,此错误率也超过志愿者实现的2.5%错误率。尽管如此,在许多情况下,该模型与人类性能相同,可用于改善转录速度,如果不准确。我们计划通过将预测的转录显示到志愿者作为转录时将该模型纳入人的转录过程中。由于在反奴隶制稿件项目中完成的协作转录,大部分基础设施已经存在于Zooniverse内。通过显示机器预测的志愿者,有很多机会提高了效率。如果计算机预测是正确的,则志愿者可以同意它而无需重新输入整个线路。如果志愿者不同意,他们可以纠正它,或完全重做转录,确保高精度。这一过程还将通过允许我们对更困难的文本进行模型培训来提高模型性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号