首页> 外文学位 >Cross-document entity co-reference resolution in noisy environments.
【24h】

Cross-document entity co-reference resolution in noisy environments.

机译:嘈杂环境中的跨文档实体共同引用解析。

获取原文
获取原文并翻译 | 示例

摘要

Cross-document entity co-reference resolution task is part of and largely based on many Natural Language Processing (NLP) innovations that have been developed in the past several decades and are still evolving. At the core of the task is determining whether two or more pieces of information, each mentioned locally in various documents, refer to the same global entity. The complexity of the task increases when the documents represent a variety of sources including unstructured text, speech, and foreign languages.;This dissertation presents a complete end-to-end solution to the problem of cross-document entity co-reference. The research divides the task into two major components: Name Matching and Entity Disambiguation. The two components have been designed for scalability, extensibility, and incremental processing.;The Name Matching component consists of more than a dozen algorithms and produces equivalences between corpus names. The algorithms use a variety of information sources which fall into four categories: World Knowledge, Web Knowledge, String Similarity, and Statistical Extraction. The produced alternatives represent various name forms: misspellings, aliases, abbreviations, alternative spellings, short and long versions, nicknames, etc.;The Entity Disambiguation component implements a 3-stage agglomerative clustering algorithm to resolve local entities to global clusters. Each of the algorithm stages uses combinations of various features to adjust discriminative levels in cluster comparisons. The features are retrieved from document metadata, information extraction, unsupervised topicality, and more.;The system has been evaluated in several ways. The Name Matching component has been tested on a corpus consisting of more than half a million documents from various genres. The performance of the Entity Disambiguation component has been measured against human-annotated collections of English and Arabic documents mentioning ambiguous person and organization names. The truth data were produced using a cross-document annotation tool designed specifically for this research.*.;*Copyright 2008 by BBN Technologies Corp. All Rights Reserved. Distribution Statement A (Approved for Public Release; Distribution Unlimited).
机译:跨文档实体共同引用解析任务是过去几十年中开发的并且仍在不断发展的许多自然语言处理(NLP)创新的一部分,并在很大程度上基于该创新。该任务的核心是确定两个或更多信息(在各个文档中分别以 local 提及)是否引用同一个全局实体。当文档代表各种来源(包括非结构化文本,语音和外语)时,任务的复杂性就会增加。;本文提出了一个完整的端到端解决方案,用于解决跨文档实体的共同引用问题。研究将任务分为两个主要部分:名称匹配和实体歧义消除。这两个组件是为可伸缩性,可扩展性和增量处理而设计的。名称匹配组件由十多种算法组成,并在语料库名称之间产生等效性。该算法使用各种信息源,这些信息源分为四类:世界知识,网络知识,字符串相似度和统计提取。产生的替代表示各种名称形式:拼写错误,别名,缩写,替代拼写,长短版本,昵称等; Entity Disambiguation组件实现了一个三阶段的聚类聚类算法,以将局部实体解析为全局聚类。每个算法阶段都使用各种功能的组合来调整聚类比较中的判别级别。可从文档元数据,信息提取,无监督的话题性等中检索功能。该系统已通过多种方式进行了评估。名称匹配组件已在包含超过半百万种不同流派的文档的语料库上进行了测试。实体歧义消除功能的性能已通过人工注释的英语和阿拉伯语文档集(涉及模棱两可的个人和组织名称)进行了衡量。真实数据是使用专门为此研究设计的跨文档注释工具生成的。*。* BBN Technologies Corp.版权所有2008。保留所有权利。分发声明A(批准公开发布;分发不受限制)。

著录项

  • 作者

    Baron, Alex.;

  • 作者单位

    Brandeis University.;

  • 授予单位 Brandeis University.;
  • 学科 Language Linguistics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 204 p.
  • 总页数 204
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号