首页> 外文学位 >Rapid Training of Information Extraction with Local and Global Data Views.
【24h】

Rapid Training of Information Extraction with Local and Global Data Views.

机译:使用本地和全局数据视图快速培训信息提取。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation focuses on fast system development for Information Extraction (IE). State-of-the-art systems heavily rely on extensively annotated corpora, which are slow to build for a new domain or task. Moreover, previous systems are mostly built with local evidence such as words in a short context window or features that are extracted at the sentence level. They usually generalize poorly on new domains.;This dissertation presents novel approaches for rapidly training an IE system for a new domain or task based on both local and global evidence. Specifically, we present three systems: a relation type extension system based on active learning, a relation type extension system based on semi-supervised learning, and a cross-domain bootstrapping system for domain adaptive named entity extraction.;The active learning procedure adopts features extracted at the sentence level as the local view and distributional similarities between relational phrases as the global view. It builds two classifiers based on these two views to find the most informative contention data points to request human labels so as to reduce annotation cost.;The semi-supervised system aims to learn a large set of accurate patterns for extracting relations between names from only a few seed patterns. It estimates the confidence of a name pair both locally and globally: locally by looking at the patterns that connect the pair in isolation; globally by incorporating the evidence from the clusters of patterns that connect the pair. The use of pattern clusters can prevent semantic drift and contribute to a natural stopping criterion for semi-supervised relation pattern discovery.;For adapting a named entity recognition system to a new domain, we propose a cross-domain bootstrapping algorithm, which iteratively learns a model for the new domain with labeled data from the original domain and unlabeled data from the new domain. We first use word clusters as global evidence to generalize features that are extracted from a local context window. We then select self-learned instances as additional training examples using multiple criteria, including some based on global evidence.
机译:本文主要研究信息提取(IE)的快速系统开发。最先进的系统严重依赖于带有批注的语料库,这些语料库对于新域或任务的构建速度很慢。此外,以前的系统大多是用本地证据构建的,例如短上下文窗口中的单词或在句子级别提取的特征。他们通常在新领域上的概括性很差。;本论文提出了基于本地和全球证据为新领域或任务快速培训IE系统的新颖方法。具体来说,我们提出了三种系统:基于主动学习的关系类型扩展系统,基于半监督学习的关系类型扩展系统以及用于域自适应命名实体提取的跨域自举系统。在句子级别提取为局部视图,在关系短语之间分配相似度作为全局视图。它基于这两个视图构建两个分类器,以找到信息最丰富的竞争数据点以请求人为标签,从而降低注释成本。半监督系统旨在学习大量准确的模式,以仅从中提取名称之间的关系一些种子模式。它可以在本地和全局范围内估计一个名称对的置信度:在本地通过查看隔离连接该名称对的模式;通过整合来自连接该货币对的模式簇的证据来在全球范围内提供证据。模式簇的使用可以防止语义漂移,并为半监督关系模式发现提供自然的停止标准。为了使命名实体识别系统适应新的领域,我们提出了一种跨域自举算法,该算法迭代学习新域的模型,其中包含来自原始域的标记数据和来自新域的未标记数据。我们首先使用词簇作为全局证据来概括从本地上下文窗口提取的特征。然后,我们使用多种标准(包括一些基于全球证据的标准)选择自学实例作为其他培训实例。

著录项

  • 作者

    Sun, Ang.;

  • 作者单位

    New York University.;

  • 授予单位 New York University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 112 p.
  • 总页数 112
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号