首页> 外文学位 >Entity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction.
【24h】

Entity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction.

机译:具有弱监督的实体分析:键入,链接和属性提取。

获取原文
获取原文并翻译 | 示例

摘要

With the advent of the Web, textual information has grown at an explosive rate. To digest this enormous amount of data, an automatic solution, Information Extraction (IE), has become necessary. Information extraction is a task of converting unstructured text strings into structured machine-readable data. The first key step of a general IE pipeline is often to analyze entities mentioned in the text before making holistic conclusions. To fully understand each entity, one needs to detect their mentions, categorize them into semantic types, connect them with their knowledge base entries, and identify their attributes as well as the relationships with others.;In this dissertation, we first present the problem of fine-grained entity recognition. Unlike most traditional named entity recognition systems using a small set of entity classes, e.g., person, organization, location or miscellaneous, we define a novel set of over one hundred fine-grained entity types. In order to intelligently understand text and extract a wide range of information, it is useful to more precisely determine the semantic classes of entities mentioned in unstructured text. We formulate the recognition problem as multi-class, multi-label classification, describe an unsupervised method for collecting training data, and present the FIGER implementation.;Next, we demonstrate that fine-grained entity types are closely connected with other entity analysis tasks. We describe an entity linking system whose prediction heavily relies on these types and present a simple yet effective implementation, called VINCULUM. An extensive evaluation on nine data sets, comparing VINCULUM with two state-of-the-art systems, elucidates key aspects of the system that include mention extraction, candidate generation, entity type prediction, entity coreference, and coherence.;Finally, we describe an approach to acquire commonsense knowledge from a massive amount of text on the Web. In particular, a system called S IZEITALL is developed to extract numerical attribute values for various classes of entities. To resolve the ambiguity from the surface form text, we canonicalize the extractions with respect to WordNet senses and build a knowledge base on physical size for thousands of entity classes.;Throughout all three entity analysis tasks, we show the feasibility of building sophisticated IE systems without a significant investment in human effort to create sufficient labeled data.
机译:随着网络的出现,文本信息以爆炸性的速度增长。为了消化大量数据,自动解决方案信息提取(IE)成为必要。信息提取是将非结构化文本字符串转换为结构化机器可读数据的任务。通用IE管道的第一步,通常是在得出整体结论之前,分析文本中提到的实体。为了充分理解每个实体,需要检测它们的提及,将其归类为语义类型,将其与知识库条目联系起来,并确定其属性以及与其他实体的关系。细粒度的实体识别。与大多数传统的使用少量实体类别(例如人,组织,位置或其他)的命名实体识别系统不同,我们定义了一组新的超过一百种细粒度的实体类型。为了智能地理解文本并提取大量信息,更精确地确定非结构化文本中提到的实体的语义类别很有用。我们将识别问题表述为多类,多标签分类,描述一种用于收集训练数据的无监督方法,并介绍FIGER的实现。接下来,我们证明细粒度的实体类型与其他实体分析任务紧密相关。我们描述了一个实体链接系统,其预测严重依赖于这些类型,并提出了一种简单而有效的实现方式,称为VINCULUM。通过将VINCULUM与两个最新系统进行比较,对9个数据集进行了广泛评估,阐明了该系统的关键方面,包括提要提取,候选者生成,实体类型预测,实体共指和相干性。一种从网络上大量文本中获取常识知识的方法。特别是,开发了一个名为S IZEITALL的系统来提取各种类别的实体的数值属性值。为了解决表面形式文本的歧义,我们规范化了有关WordNet感官的提取,并针对数千个实体类的物理大小建立了知识库。通过所有三个实体分析任务,我们展示了构建复杂的IE系统的可行性无需投入大量人力来创建足够的标签数据。

著录项

  • 作者

    Ling, Xiao.;

  • 作者单位

    University of Washington.;

  • 授予单位 University of Washington.;
  • 学科 Computer science.;Artificial intelligence.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 100 p.
  • 总页数 100
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号