【24h】

Improving Lemmatization of Non-Standard Languages with Joint Learning

机译:通过联合学习提高非标准语言的合法化

获取原文

摘要

Lemmatization of standard languages is concerned with (ⅰ) abstracting over morphological differences and (ⅱ) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (ⅲ): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an encoder-decoder architecture which we enrich with sentence context information using a hierarchical sentence encoder. We show significant improvements over the state-of-thc-art when training the sentence encoder jointly for lemmatization and language modeling. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typolog-ically diverse standard languages showing results on par or better than a model without enhanced sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, based on openly accessible sources.
机译:标准语言的合法化涉及(ⅰ)提取词法差异,以及(ⅱ)解决单词变形的词缀-词义歧义,以便将其映射到字典单词。在本文中,我们旨在提高一组非标准历史语言的词素化性能,在这些非标准历史语言中,难度增加了一个额外的方面(ⅲ):由于缺少拼字法标准而导致的拼写变化。我们使用编码器-解码器体系结构将词法化作为字符串转导任务来处理,该体系结构使用分层句子编码器丰富了句子上下文信息。当共同训练句子编码器进行词素化和语言建模时,我们显示出相对于最新技术的显着改进。至关重要的是,我们的体系结构不需要POS或形态注释,而这些注释并非总是可用于历史语料库。此外,我们还在一组在打字学上多样化的标准语言上测试了提出的模型,该模型显示的结果与没有增强的句子表示形式和以前的最新技术水平的模型相比,具有同等或更好的结果。最后,为鼓励将来在非标准品种加工方面的工作,我们基于可公开获取的资源,发布了本研究所依据的非标准语言的数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号