首页> 外文会议>International conference on computational linguistics >Active Learning for Chinese Word Segmentation
【24h】

Active Learning for Chinese Word Segmentation

机译:主动学习中文分词

获取原文

摘要

Currently, the best performing models for Chinese word segmentation (CWS) are extremely resource intensive in terms of annotation data quantity. One promising solution to minimize the cost of data acquisition is active learning, which aims to actively select the most useful instances to annotate for learning. Active learning on CWS, however, remains challenging due to its inherent nature. In this paper, we propose a Word Boundary Annotation (WBA) model to make effective active learning on CWS possible. This is achieved by annotating only those uncertain boundaries. In this way, the manual annotation cost is largely reduced, compared to annotating the whole character sequence. To further minimize the annotation effort, a diversity measurement among the instances is considered to avoid duplicate annotation. Experimental results show that employing the WBA model and the diversity measurement into active learning on CWS can save much annotation cost with little loss in the performance.
机译:当前,就注释数据量而言,性能最佳的中文分词模型(CWS)占用的资源非常多。主动学习是一种将数据获取成本降至最低的有前途的解决方案,其目的是主动选择最有用的实例进行注释。但是,由于CWS的固有性质,因此主动学习仍然具有挑战性。在本文中,我们提出了单词边界注释(WBA)模型,以使在CWS上进行有效的主动学习成为可能。这是通过仅注释那些不确定的边界来实现的。这样,与对整个字符序列进行注释相比,手动注释的成本大大降低了。为了进一步最小化注释工作,考虑了实例之间的分集测量以避免重复注释。实验结果表明,将WBA模型和分集测量应用于CWS的主动学习可以节省很多注释成本,而性能损失很小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号