首页> 美国卫生研究院文献>Database: The Journal of Biological Databases and Curation >Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems
【2h】

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems

机译:使用本体的表型注释:训练和评估自然语言处理系统的金标准

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity–quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine–human consistency, or similarity, was significantly lower than inter-curator (human–human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.
机译:生物表型是生物学研究的主要对象,自然语言对生物表型的描述在生物学文献中十分丰富。将这些表型表达为使用本体的逻辑陈述将能够对来自不同系统的表型信息进行大规模分析。但是,要使这些表型描述适合机器推理,需要付出大量的人力。已经开发了自然语言处理工具来促进此任务,并且对这些工具的训练和评估取决于高质量,手动注释的金标准数据集的可用性。我们描述了进化生物学的专家表述的注释表型的金标准数据集的发展。开发了黄金标准,用于管理Phenoscape项目的复杂比较表型。它是由三位策展人之间的共识创建的,由复杂程度不同的实体质量表示形式组成。我们使用黄金标准来评估人类策展人创建的注释以及语义CharaParser工具生成的注释。使用四个注释准确性度量标准,这些度量标准可以解释来自两个表型注释的术语之间的任何级别的关系,我们发现机器-人的一致性或相似性显着低于策展人之间(人-人)的一致性。令人惊讶的是,允许策展人访问外部信息并不会显着增加其注释与黄金标准的相似度,也不会显着影响策展人之间的一致性。我们发现,在添加了新的相关本体术语后,机器注释与黄金标准的相似性增加了。原始作者对字符描述的评估表明,与策展人或机器注释相比,金标准注释更接近于表示其预期含义。这些发现指向更好地设计软件以增强人类策展人的方式,并且使用黄金标准语料库将允许培训和评估新工具,以大规模地提高表型注释的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号