首页> 外文学位 >Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

【24h】

Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

机译：利用来自高资源语言的培训数据来改善对低资源语言的依赖关系解析

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Dependency parsing is an important natural language processing (NLP) task with many downstream applications, and as is common in the field, high accuracy results can be obtained when using statistical methods and training on high-quality annotated training data. When dealing with low-resource languages where annotated training data is not available and prohibitively expensive to obtain, more clever methods must be used to leverage existing resources. My work in this thesis focuses on instance selection, which rests on the assumption, little explored cross-linguistically but well-proven monolingually in domain adaptation, that using less training data that is more relevant to your test case is better than using a full pool of potentially highly irrelevant training data. I conduct a larger, more thorough exploration than has previously been attempted into instance selection based on the perplexity of part-of-speech tag sequences, using the Google Universal Dependency Treebank, which spans ten languages. Additionally, I leverage another instance selection technique based on cross-entropy difference, which has shown superior results to perplexity selection when used for domain adaptation. These methods are both applied to two different potential pools of training data, one being the combination of multiple source languages, the other being English alone. Lastly, I explore automatic rearrangement of the part-of-speech tags in the English training data to better match three potential target languages. These experiments show mixed results, which may help to inform future exploration in dependency parsing for low-resource languages. When a pool of multiple source languages is used, a significant boost is seen for target languages where relevant training data is available but infrequent in the training data, with cross-entropy difference providing slightly better performance than perplexity selection. However, these methods don't provide the same large improvements for target languages where lots of relevant training data is available among the multiple source languages or when English alone is used as the training data. Rearranging the part-of-speech tags has a small positive impact on the scores when using the entire training dataset, which is promising for more extensive rearrangement. However, applying instance selection methods to select training data from this rearranged data does not yield better results than selecting training data from the non-rearranged data.

机译：依赖项解析是许多下游应用程序中的重要自然语言处理（NLP）任务，并且在本领域中很常见，当使用统计方法和对高质量带注释的训练数据进行训练时，可以获得高精度的结果。当处理资源贫乏的语言（其中带注释的培训数据不可用且获取成本过高）时，必须使用更巧妙的方法来利用现有资源。我在本文中的工作集中在实例选择上，该实例选择基于以下假设：在语言适应性方面很少进行跨语言探索，但在语言领域适应方面得到了单语言的充分证明，即使用较少的与测试案例更相关的训练数据比使用完整的资源更好。可能高度不相关的训练数据。我使用跨十种语言的Google Universal Dependency Treebank，根据词性标记序列的困惑进行了比以前尝试更大，更彻底的探索。此外，我利用了另一种基于交叉熵差异的实例选择技术，该技术在用于领域自适应时显示出优于困惑选择的结果。这些方法都适用于两种不同的潜在训练数据库，一种是多种源语言的组合，另一种是英语。最后，我探讨了英语培训数据中词性标签的自动重排，以更好地匹配三种潜在的目标语言。这些实验显示出混合的结果，这可能有助于为将来在低资源语言的依赖项解析中进行探索提供参考。当使用多种源语言的库时，对于具有相关训练数据但在训练数据中很少出现的目标语言，可以看到明显的提升，交叉熵差异提供的性能比困惑选择略好。但是，这些方法对于目标语言不能提供相同的大改进，因为目标语言在多种源语言中有大量相关的培训数据可用，或者仅将英语用作培训数据。使用整个训练数据集时，重新排列词性标签对分数有很小的积极影响，这有望实现更广泛的重新排列。但是，应用实例选择方法从重排的数据中选择训练数据不会比从非重排的数据中选择训练数据产生更好的结果。

著录项

作者
Jaja, Claire.;
展开▼
作者单位

University of Washington.;

展开▼
授予单位 University of Washington.;
学科 Computer science.;Linguistics.
学位 Masters
年度 2014
页码 127 p.
总页数 127
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages [J] . Hai-Long Trieu, Duc-Vu Tran, Ittoo Ashwin, ACM transactions on Asian language information processing . 2019,第3期

机译：利用其他资源来改善亚洲低资源语言的统计机器翻译
2. Improving Graph-Based Dependency Parsing Models With Dependency Language Models [J] . Zhang M., Chen W., Duan X., Audio, Speech, and Language Processing, IEEE Transactions on . 2013,第11期

机译：使用依赖语言模型改进基于图的依赖分析模型
3. MaltParser: A language-independent system for data-driven dependency parsing [J] . JOAKIM NIVRE, JOHAN HALL, JENS NILSSON, Natural language engineering . 2007,第Pt2期

机译：MaltParser：一种独立于语言的系统，用于数据驱动的依赖项解析
4. Improved acoustic modeling of low-resource languages using shared SGMM parameters of high-resource languages [C] . Neethu Mariam Joy, Basil Abraham, Navneeth K, National Conference on Communications . 2016

机译：使用高资源语言的共享SGMM参数改进低资源语言的声学建模
5. Text-to-Speech Synthesis Using Found Data for Low-Resource Languages [D] . Cooper, Erica 2019

机译：使用低资源语言的数据对文本进行语音合成
6. Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion [O] . Chenggang Mi, Shaolin Zhu, Rui Nie 2021

机译：利用数据增强和多个特征融合在低资源语言中提高笔记识别
7. Data-Driven Dependency Parsing of New Languages Using Incomplete and Noisy Training Data [O] . Kathrin Spreyer, Jonas Kuhn 2010

机译：使用不完整且嘈杂的训练数据对新语言进行数据驱动的依存关系解析

Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

摘要

著录项

相似文献

相关主题

期刊订阅