Multilingual distributional lexical similarity.

机译：多语言分布词汇相似度。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of the most fundamental problems in natural language processing involves words that are not in the dictionary, or unknown words. The supply of unknown words is virtually unlimited---proper names, technical jargon, foreign borrowings, newly created words, etc.---meaning that lexical resources like dictionaries and thesauri inevitably miss important vocabulary items. However, manually creating and maintaining broad coverage dictionaries and ontologies for natural language processing is expensive and difficult. Instead, it is desirable to learn them from distributional lexical information such as can be obtained relatively easily from unlabeled or sparsely labeled text corpora. Rule-based approaches to acquiring or augmenting repositories of lexical information typically offer a high precision, low recall methodology that fails to generalize to new domains or scale to very large datasets. Classification-based approaches to organizing lexical material have more promising scaling properties, but require an amount of labeled training data that is usually not available on the necessary scale.;This dissertation addresses the problem of learning an accurate and scalable lexical classifier in the absence of large amounts of hand-labeled training data. One approach to this problem involves using a rule-based system to generate large amounts of data that serve as training examples for a secondary lexical classifier. The viability of this approach is demonstrated for the task of automatically identifying English loanwords in Korean. A set of rules describing changes English words undergo when they are borrowed into Korean is used to generate training data for an etymological classification task. Although the quality of the rule-based output is low, on a sufficient scale it is reliable enough to train a classifier that is robust to the deficiencies of the original rule-based output and reaches a level of performance that has previously been obtained only with access to substantial hand-labeled training data.;The second approach to the problem of obtaining labeled training data uses the output of a statistical parser to automatically generate lexical-syntactic co-occurrence features. These features are used to partition English verbs into lexical semantic classes, producing results on a substantially larger scale than any previously reported and yielding new insights into the properties of verbs that are responsible for their lexical categorization. The work here is geared towards automatically extending the coverage of verb classification schemes such as Levin, VerbNet, and FrameNet to other verbs that occur in a large text corpus.

机译：自然语言处理中最基本的问题之一是词典中没有的单词或未知单词。未知词的提供实际上是无限的-专有名称，技术术语，外国借款，新创建的词等-意味着词典和叙词表等词汇资源不可避免地会错过重要的词汇项目。但是，手动创建和维护用于自然语言处理的广泛词典和本体是昂贵且困难的。相反，期望从诸如可以相对容易地从未标记或稀疏标记的文本语料库获得的分布词汇信息中学习它们。基于规则的获取或扩充词汇信息存储库的方法通常提供一种高精度，低召回率的方法，该方法无法推广到新的领域或无法扩展到非常大的数据集。基于分类的组织词汇材料的方法具有更可观的扩展特性，但需要一定数量的标记训练数据，而这些数据通常在必要的规模上是不可用的。本论文解决了在缺少词法分类器的情况下学习准确且可扩展的词法分类器的问题。大量手工标记的培训数据。解决此问题的一种方法涉及使用基于规则的系统来生成大量数据，这些数据用作辅助词汇分类器的训练示例。演示了这种方法在自动识别韩语英语借词中的可行性。一组描述英语单词借用韩语时发生的变化的规则用于生成词源分类任务的训练数据。尽管基于规则的输出的质量很低，但是在足够的规模上，它足以训练分类器，该分类器对原始基于规则的输出的缺陷具有鲁棒性，并且可以达到以前只能通过以下方式获得的性能水平：访问大量手工标记的训练数据。解决获取标记的训练数据的第二种方法是使用统计解析器的输出来自动生成词汇-句法共现特征。这些功能用于将英语动词划分为词汇语义类，产生的结果比以前报道的要大得多，并且对动词的性质产生了新的见解，这些动词的性质归因于其词法分类。此处的工作旨在自动将动词分类方案（例如Levin，VerbNet和FrameNet）的覆盖范围扩展到大型文本语料库中出现的其他动词。

著录项

作者
Baker, Kirk.;
展开▼
作者单位

The Ohio State University.;

展开▼
授予单位 The Ohio State University.;
学科 Language Linguistics.
学位 Ph.D.
年度 2008
页码 243 p.
总页数 243
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Activating learning using multilingual CALL lexical resources: A regional culture-oriented multilingual visual dictionary project [J] . Janet M.D. Higgins Procedia - Social and Behavioral Sciences . 2012,第2期

机译：使用多语言CALL词汇资源激活学习：一个面向区域文化的多语言视觉词典项目
2. Semantic classification of biomedical concepts using distributional similarity. [J] . Fan JW, Friedman C Journal of the American Medical Informatics Association : . 2007,第4期

机译：使用分布相似性对生物医学概念进行语义分类。
3. LessLex: Linking Multilingual Embeddings to SenSe Representations of LEXical Items [J] . Davide Colla, Enrico Mensa, Daniele P. Radicioni Computational linguistics . 2020,第2期

机译：lesslex：将多语种嵌入式链接到感知词汇项目的表示
4. ITOLDU, a Web Service to Pool Technical Lexical Terms in a Learning Environment and Contribute to Multilingual Lexical Databases [C] . Valerie Bellynck, Christian Boitet, John Kenwright International Conference on Computational Linguistics and Intelligent Text Processing(CICLing 2005); 20050213-19; Mexico City(MX) . 2005

机译：ITOLDU，一种在学习环境中汇集技术词汇术语并为多语言词汇数据库做出贡献的Web服务
5. The Acquisition and Mechanisms of Lexical Regulation in Multilinguals [D] . Tomoschuk, Brendan. 2019

机译：词汇监管在多种语言中的收购与机制
6. Psychocentricity and participant profiles: implications for lexical processing among multilinguals [O] . Gary Libben, Kaitlin Curtiss, Silke Weber 2014

机译：心理中心性和参与者特征：对多语种词汇处理的影响
7. Activating learning using multilingual CALL lexical resources: A regional culture-oriented multilingual visual dictionary project [O] . Higgins Janet M.D. 2012

机译：使用多语言CALL词汇资源进行学习：一个面向区域文化的多语言视觉词典项目

Multilingual distributional lexical similarity.

摘要

著录项

相似文献

相关主题

期刊订阅