The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

机译：STEM-ECR数据集：在STEM学术内容中接地科学实体参考权威百科全书和词典来源

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We introduce the STEM (Science. Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a mullidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific-entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.

机译：我们介绍了科学实体提取，分类和分辨率，版本1.0（Stew-ECR V1.0）的科学实体提取，分类和分辨率技术，工程和医学）数据集。已经开发了STEM-ECR V1.0数据集以提供基准，以便以域 - 独立的方式评估科学实体提取，分类和解决任务。它包括10个词干学科的摘要，这些学科被发现是主要出版平台上最多产的。我们描述了这种多学科语料库的创建，并在以下特征方面突出显示所获得的结果：1）MullidIsciSinary科学背景中的科学实体的通用概念形式主义; 2）在这种通用形式主义下，域独立的人类诠释科学实体的可行性; 3）利用基于伯特的神经模型来自动提取多学科科学实体的性能基准; 4）通过百科全书的实体联系和词典词语歧义，划定的3步实体解决程序，用于人类对科学实体的人体注释; 5）Babelfy的人类评估为我们的实体返回百科全书和词典感官。我们的研究结果表明，人类注释和自动学习多学科科学概念以及它们在宽范围的环境中的语义歧义为茎是合理的。

著录项

来源
《International Conference on Language Resources and Evaluation》|2020年|2192-2203|共12页
会议地点
作者
Jennifer DSouza; Anett Hoppe; Arthur Brack; Mohamad Yaser Jaradeh; Soren Auer; Ralph Ewerth;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Entity Recognition; Entity Classification; Entity Resolution; Entity Linking; Word Sense Disambiguation; Evaluation Corpus; Language Resource;

机译：实体识别;实体分类;实体分辨率;实体链接;词感歧义;评估语料库;语言资源;

相似文献

外文文献
中文文献
专利

1. REPRESENTATION OF SCIENTIFIC CONTENT OF ENCYCLOPEDIC TOPICS IN SCIENTOMETRIC AND ABSTRACT DATABASES [J] . Valeriy Yu.Bykov, Olga P.Pinchuk, Liliia A.Luparenko Information Technologies and Learning Tools . 2021,第5期

机译：科学研究和抽象数据库百科全书的科学内容的代表
2. Webbing Semantified Scholarly Communication Datasets for Improved Resource Discovery [J] . Atif Latif, Klaus Tochtermann Journal of digital information management . 2012,第4期

机译：织带化学术交流数据集以改善资源发现
3. ebrary offers new mobile app for accessing authoritative content from multiple sources [J] . Library hi tech news . 2012,第3期

机译：ebrary提供了新的移动应用程序，可从多个来源访问权威内容
4. AN ONTOLOGY BASED APPROACH FOR GEOSPATIAL DATA INTEGRATION OF AUTHORITATIVE AND CROWD SOURCED DATASETS [C] . DU H., JIANG W., ANAND S., International Cartographic Conference . 2011

机译：基于本体论授权和人群资源数据集的地理空间数据集成方法
5. Applying Automated Techniques to Large Seismic Datasets for Systematic Analyses of Phases, Source, and Structure [D] . Ross, Zachary Elias. 2016

机译：将自动化技术应用于大地震数据集，以系统分析阶段，来源和结构
6. Determining similarity of scientific entities in annotation datasets [O] . Guillermo Palma, Maria-Esther Vidal, Eric Haag, 2015

机译：确定注释数据集中科学实体的相似性
7. VOLUNTEER LEXICOGRAPHY (WITH SPECIAL REFERENCE TO EXPLANATORY- ENCYCLOPEDIC DICTIONARY “FLORENCE IN THE WORKS OF WORLD FAMOUS PEOPLE”) [O] . Olga M. Karpova 2020

机译：志愿词典（特别参考世界着名人物作品的佛罗伦萨）
8. Further Delination of the Utilization of Scientific Literature by U.S. Patents. Appendices A: Distribution of References Contained in Patents to Various Sources [R] . Carpenter, M. P. , Narin, F. , McAllister, P. 1982

机译：美国专利对科学文献利用的进一步阐释。附录a：各种来源的专利中包含的参考文献的分布

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

摘要

著录项

相似文献

相关主题

期刊订阅