首页> 外文学位 >Multilingual distributional lexical similarity.
【24h】

Multilingual distributional lexical similarity.

机译:多语言分布词汇相似度。

获取原文
获取原文并翻译 | 示例

摘要

One of the most fundamental problems in natural language processing involves words that are not in the dictionary, or unknown words. The supply of unknown words is virtually unlimited---proper names, technical jargon, foreign borrowings, newly created words, etc.---meaning that lexical resources like dictionaries and thesauri inevitably miss important vocabulary items. However, manually creating and maintaining broad coverage dictionaries and ontologies for natural language processing is expensive and difficult. Instead, it is desirable to learn them from distributional lexical information such as can be obtained relatively easily from unlabeled or sparsely labeled text corpora. Rule-based approaches to acquiring or augmenting repositories of lexical information typically offer a high precision, low recall methodology that fails to generalize to new domains or scale to very large datasets. Classification-based approaches to organizing lexical material have more promising scaling properties, but require an amount of labeled training data that is usually not available on the necessary scale.;This dissertation addresses the problem of learning an accurate and scalable lexical classifier in the absence of large amounts of hand-labeled training data. One approach to this problem involves using a rule-based system to generate large amounts of data that serve as training examples for a secondary lexical classifier. The viability of this approach is demonstrated for the task of automatically identifying English loanwords in Korean. A set of rules describing changes English words undergo when they are borrowed into Korean is used to generate training data for an etymological classification task. Although the quality of the rule-based output is low, on a sufficient scale it is reliable enough to train a classifier that is robust to the deficiencies of the original rule-based output and reaches a level of performance that has previously been obtained only with access to substantial hand-labeled training data.;The second approach to the problem of obtaining labeled training data uses the output of a statistical parser to automatically generate lexical-syntactic co-occurrence features. These features are used to partition English verbs into lexical semantic classes, producing results on a substantially larger scale than any previously reported and yielding new insights into the properties of verbs that are responsible for their lexical categorization. The work here is geared towards automatically extending the coverage of verb classification schemes such as Levin, VerbNet, and FrameNet to other verbs that occur in a large text corpus.
机译:自然语言处理中最基本的问题之一是词典中没有的单词或未知单词。未知词的提供实际上是无限的-专有名称,技术术语,外国借款,新创建的词等-意味着词典和叙词表等词汇资源不可避免地会错过重要的词汇项目。但是,手动创建和维护用于自然语言处理的广泛词典和本体是昂贵且困难的。相反,期望从诸如可以相对容易地从未标记或稀疏标记的文本语料库获得的分布词汇信息中学习它们。基于规则的获取或扩充词汇信息存储库的方法通常提供一种高精度,低召回率的方法,该方法无法推广到新的领域或无法扩展到非常大的数据集。基于分类的组织词汇材料的方法具有更可观的扩展特性,但需要一定数量的标记训练数据,而这些数据通常在必要的规模上是不可用的。本论文解决了在缺少词法分类器的情况下学习准确且可扩展的词法分类器的问题。大量手工标记的培训数据。解决此问题的一种方法涉及使用基于规则的系统来生成大量数据,这些数据用作辅助词汇分类器的训练示例。演示了这种方法在自动识别韩语英语借词中的可行性。一组描述英语单词借用韩语时发生的变化的规则用于生成词源分类任务的训练数据。尽管基于规则的输出的质量很低,但是在足够的规模上,它足以训练分类器,该分类器对原始基于规则的输出的缺陷具有鲁棒性,并且可以达到以前只能通过以下方式获得的性能水平:访问大量手工标记的训练数据。解决获取标记的训练数据的第二种方法是使用统计解析器的输出来自动生成词汇-句法共现特征。这些功能用于将英语动词划分为词汇语义类,产生的结果比以前报道的要大得多,并且对动词的性质产生了新的见解,这些动词的性质归因于其词法分类。此处的工作旨在自动将动词分类方案(例如Levin,VerbNet和FrameNet)的覆盖范围扩展到大型文本语料库中出现的其他动词。

著录项

  • 作者

    Baker, Kirk.;

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Language Linguistics.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 243 p.
  • 总页数 243
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号