首页> 外文会议> >Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets
【24h】

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

机译:大海捞针:减少注释不平衡数据集中的稀有实例的成本

获取原文

摘要

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.
机译:众包数据注释比受过训练的工人注释更嘈杂。先前的工作表明,多余的注释可以消除众包工作者与训练有素的工作者之间的共识鸿沟。冗余注释通常是没有问题的,因为在类平衡的数据集中,单独的众包判断不那么便宜。但是,类不平衡数据集上的冗余注释每个实例需要更多标签。在本文中,使用三个类不平衡语料,我们证明了用于减少噪声的注释冗余在类不平衡数据集上非常昂贵,对于接收单个公共类标签的实例应将其丢弃。我们还表明,这种简单的技术可以以与元数据训练的监督级联机器分类器大致相同的成本生成注释,或者比5票多数投票聚集便宜约70%。

著录项

  • 来源
    《》|2015年|244-253|共10页
  • 会议地点
  • 作者单位
  • 会议组织
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号