首页> 外文学位 >Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages
【24h】

Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

机译:利用来自高资源语言的培训数据来改善对低资源语言的依赖关系解析

获取原文
获取原文并翻译 | 示例

摘要

Dependency parsing is an important natural language processing (NLP) task with many downstream applications, and as is common in the field, high accuracy results can be obtained when using statistical methods and training on high-quality annotated training data. When dealing with low-resource languages where annotated training data is not available and prohibitively expensive to obtain, more clever methods must be used to leverage existing resources. My work in this thesis focuses on instance selection, which rests on the assumption, little explored cross-linguistically but well-proven monolingually in domain adaptation, that using less training data that is more relevant to your test case is better than using a full pool of potentially highly irrelevant training data. I conduct a larger, more thorough exploration than has previously been attempted into instance selection based on the perplexity of part-of-speech tag sequences, using the Google Universal Dependency Treebank, which spans ten languages. Additionally, I leverage another instance selection technique based on cross-entropy difference, which has shown superior results to perplexity selection when used for domain adaptation. These methods are both applied to two different potential pools of training data, one being the combination of multiple source languages, the other being English alone. Lastly, I explore automatic rearrangement of the part-of-speech tags in the English training data to better match three potential target languages. These experiments show mixed results, which may help to inform future exploration in dependency parsing for low-resource languages. When a pool of multiple source languages is used, a significant boost is seen for target languages where relevant training data is available but infrequent in the training data, with cross-entropy difference providing slightly better performance than perplexity selection. However, these methods don't provide the same large improvements for target languages where lots of relevant training data is available among the multiple source languages or when English alone is used as the training data. Rearranging the part-of-speech tags has a small positive impact on the scores when using the entire training dataset, which is promising for more extensive rearrangement. However, applying instance selection methods to select training data from this rearranged data does not yield better results than selecting training data from the non-rearranged data.
机译:依赖项解析是许多下游应用程序中的重要自然语言处理(NLP)任务,并且在本领域中很常见,当使用统计方法和对高质量带注释的训练数据进行训练时,可以获得高精度的结果。当处理资源贫乏的语言(其中带注释的培训数据不可用且获取成本过高)时,必须使用更巧妙的方法来利用现有资源。我在本文中的工作集中在实例选择上,该实例选择基于以下假设:在语言适应性方面很少进行跨语言探索,但在语言领域适应方面得到了单语言的充分证明,即使用较少的与测试案例更相关的训练数据比使用完整的资源更好。可能高度不相关的训练数据。我使用跨十种语言的Google Universal Dependency Treebank,根据词性标记序列的困惑进行了比以前尝试更大,更彻底的探索。此外,我利用了另一种基于交叉熵差异的实例选择技术,该技术在用于领域自适应时显示出优于困惑选择的结果。这些方法都适用于两种不同的潜在训练数据库,一种是多种源语言的组合,另一种是英语。最后,我探讨了英语培训数据中词性标签的自动重排,以更好地匹配三种潜在的目标语言。这些实验显示出混合的结果,这可能有助于为将来在低资源语言的依赖项解析中进行探索提供参考。当使用多种源语言的库时,对于具有相关训练数据但在训练数据中很少出现的目标语言,可以看到明显的提升,交叉熵差异提供的性能比困惑选择略好。但是,这些方法对于目标语言不能提供相同的大改进,因为目标语言在多种源语言中有大量相关的培训数据可用,或者仅将英语用作培训数据。使用整个训练数据集时,重新排列词性标签对分数有很小的积极影响,这有望实现更广泛的重新排列。但是,应用实例选择方法从重排的数据中选择训练数据不会比从非重排的数据中选择训练数据产生更好的结果。

著录项

  • 作者

    Jaja, Claire.;

  • 作者单位

    University of Washington.;

  • 授予单位 University of Washington.;
  • 学科 Computer science.;Linguistics.
  • 学位 Masters
  • 年度 2014
  • 页码 127 p.
  • 总页数 127
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号