首页> 外文会议>Conference on Empirical Methods in Natural Language Processing >Tackling the Low-resource Challenge for Canonical Segmentation
【24h】

Tackling the Low-resource Challenge for Canonical Segmentation

机译:解决规范分割的低资源挑战

获取原文

摘要

Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages.
机译:典型形态分割包括把话到他们的标准化语素。在这里,我们感兴趣的是该任务的方法时,训练数据是有限的。我们比较了高资源德语,英语一模拟低资源设置模型的性能,以及印尼对的真正低资源语言Popoluca和Tepehua新的数据集实验。我们探索两个新型号的任务,从形态产生密切相关的领域借用:一个LSTM指针发生器和序列到序列模型与模仿学习刻苦训练单调的关注。我们发现,在低资源设置,新的方法跑赢大盘高达11.4%的准确率在所有语言现有的。然而,虽然精度模拟低资源场景是所有语言的超过50%,对于真正的低资源语言Popoluca和Tepehua,我们最好的模型只取得精度37.4%和28.4%,分别。因此,我们得出结论,规范分割仍是低资源语言一项艰巨的任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号