首页> 外文会议>IAPR International Workshop on Document Analysis Systems >The Notary in the Haystack - Countering Class Imbalance in Document Processing with CNNs
【24h】

The Notary in the Haystack - Countering Class Imbalance in Document Processing with CNNs

机译:干草堆中的公证人-使用CNN应对文档处理中的类不平衡

获取原文

摘要

Notarial instruments are a category of documents. A notarial instrument can be distinguished from other documents by its notary sign, a prominent symbol in the certificate, which also allows to identify the document's issuer. Naturally, notarial instruments are underrepre-sented in regard to other documents. This makes a classification difficult because class imbalance in training data worsens the performance of Con-volutional Neural Networks. In this work, we evaluate different counter-measures for this problem. They are applied to a binary classification and a segmentation task on a collection of medieval documents. In classification, notarial instruments are distinguished from other documents, while the notary sign is separated from the certificate in the segmentation task. We evaluate different techniques, such as data augmentation, under- and oversampling, as well as regularizing with focal loss. The combination of random minority oversampling and data augmentation leads to the best performance. In segmentation, we evaluate three loss-functions and their combinations, where only class-weighted dice loss was able to segment the notary sign sufficiently.
机译:公证文书是一类文件。公证文书可以通过其公证标志(证书中的醒目符号)与其他文件区分开,这也可以识别文件的发行人。自然,对于其他文件,公证文书的代表性不足。这使分类变得困难,因为训练数据中的类不平衡会恶化卷积神经网络的性能。在这项工作中,我们评估了针对此问题的不同对策。它们应用于中世纪文档集合的二进制分类和分段任务。在分类中,公证文书与其他文件区分开来,而公证符号则与分割任务中的证书分开。我们评估了不同的技术,例如数据增强,欠采样和过采样,以及因焦距损失而进行的正则化。随机少数族群过采样和数据增强相结合可产生最佳性能。在分割中,我们评估了三个损失函数及其组合,其中只有类别加权的骰子损失才能够充分分割公证符。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号