首页> 外文会议>International Conference on Language Resources and Evaluation >The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
【24h】

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

机译:STEM-ECR数据集:在STEM学术内容中接地科学实体参考权威百科全书和词典来源

获取原文

摘要

We introduce the STEM (Science. Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a mullidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific-entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.
机译:我们介绍了科学实体提取,分类和分辨率,版本1.0(Stew-ECR V1.0)的科学实体提取,分类和分辨率技术,工程和医学)数据集。已经开发了STEM-ECR V1.0数据集以提供基准,以便以域 - 独立的方式评估科学实体提取,分类和解决任务。它包括10个词干学科的摘要,这些学科被发现是主要出版平台上最多产的。我们描述了这种多学科语料库的创建,并在以下特征方面突出显示所获得的结果:1)MullidIsciSinary科学背景中的科学实体的通用概念形式主义; 2)在这种通用形式主义下,域独立的人类诠释科学实体的可行性; 3)利用基于伯特的神经模型来自动提取多学科科学实体的性能基准; 4)通过百科全书的实体联系和词典词语歧义,划定的3步实体解决程序,用于人类对科学实体的人体注释; 5)Babelfy的人类评估为我们的实体返回百科全书和词典感官。我们的研究结果表明,人类注释和自动学习多学科科学概念以及它们在宽范围的环境中的语义歧义为茎是合理的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号