首页> 中文期刊> 《中文信息学报》 >基于弱监督和半自动方法的中文关系抽取数据集构建

基于弱监督和半自动方法的中文关系抽取数据集构建

         

摘要

The relation extraction is a fundamental task in information extraction,with practical significance in infor-mation retrieval,question answering system and knowledge mapping,etc.The existing relation extraction data set are for English,containing very limited categories and neglecting sentence level annotations.This paper constructs a Chinese relation extraction data set using a weakly supervised and semi-automatic method.It firstly extracts a large amount of relation pairs from Wikipedia,then extracts sentences that contains entity pairs from the corpus of So -ugou News and Baidu. Thus the weakly supervised sentence extracting is completed. These sentences are then scored in an RNN-based relation extraction system,selecting sentences with higher score for manual annotation.Fi-nally the Chinese relation extraction data set is completed after manual annotation.%关系抽取是信息抽取中的一项基础任务,对信息检索、问答系统、知识图谱等有非常重要的意义.现有的关系抽取数据集存在包含类别太少、句子标注困难、不易扩展等缺陷,且只有英文数据集,不能很好地解决中文关系抽取任务.该文采用弱监督和半自动的方法,构建了一份中文关系抽取数据集,弥补了上述不足.首先借助维基百科抽取出丰富的关系对,从百度搜索返回结果及搜狗新闻语料中抽取包含实体对的句子,完成弱监督句子抽取过程.将句子放入RNN关系抽取系统进行打分,选取标注价值高的句子提交人工标注,对标注结果进行处理,最终得到中文关系抽取数据集.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号