...
首页> 外文期刊>Semantic web >Distantly supervised Web relation extraction for knowledge base population
【24h】

Distantly supervised Web relation extraction for knowledge base population

机译:远程监督的Web关系抽取知识库

获取原文
获取原文并翻译 | 示例
           

摘要

Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, one of those approaches being distant supervision. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. In this paper we propose the use of distant supervision for relation extraction from the Web. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co-reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%.
机译:从网页中提取信息以填充大型的跨域知识库,这需要跨域适用的方法,不需要人工适应新的域,能够处理噪音并整合从不同网页中提取的信息。最近的方法已经使用现有的知识库来学习以有希望的结果来提取信息,这些方法之一是远程监管。远程监督是一种不受监督的方法,它使用链接开放数据云中的背景信息来自动标记具有关系的句子,从而为关系分类器创建训练数据。在本文中,我们建议使用远程监管从Web提取关系。尽管该方法很有希望,但是现有方法仍然不适合Web提取,因为它们存在以下三个主要问题:数据稀疏,噪声和词汇歧义。我们的方法通过使实体识别工具在各个域中更强大,并使用无监督的共同引用解析方法来提取跨句子边界的关系,从而降低了数据稀疏性的影响。通过采用统计方法来策略性地选择训练数据,我们减少了词汇歧义引起的噪音。为了结合从多个来源提取的信息来填充知识库,我们提出并评估了几种信息集成策略,并表明这些策略从使用共引用分辨率提取的其他关系提及中受益匪浅,将精度提高了8%。我们进一步表明,战略性地选择训练数据可以使精度进一步提高3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号