...
首页> 外文期刊>Journal of Cheminformatics >A document processing pipeline for annotating chemical entities in scientific documents
【24h】

A document processing pipeline for annotating chemical entities in scientific documents

机译:用于在科学文件中注释化学实体的文件处理管道

获取原文
           

摘要

Background The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. Results We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. Conclusions We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://?bioinformatics.?ua.?pt/?becas-chemicals/? .
机译:背景技术鉴于已出版文本(科学论文,专利,患者病历)的数量迅速增长以及这些文本和其他相关文本的相关性,在文本中识别药物和化学实体是生物医学信息提取领域中非常重要的任务。概念。如果有效完成,这可以允许利用此类文本资源来自动提取或推断相关信息,例如毒品资料,毒品之间的关系和相似性,或毒品与潜在毒品目标之间的关联。这项工作的目的是开发和验证用于识别文本中提及的化学实体的文档处理和信息提取管道。结果我们使用BioCreative IV CHEMDNER任务数据来训练和评估基于机器学习的实体识别系统。通过使用两个条件随机场模型,一组选定的特征以及一个后处理阶段的组合,我们在化学实体提及识别任务中获得了87.48%的化学计量结果,在化学文档索引任务中获得了87.75%的F测量结果。结论我们提出了一种基于机器学习的解决方案,用于自动识别科学文档中的化学名称和药物名称。所提出的方法应用了丰富的特征集,包括语言,正字,形态,字典匹配和局部上下文特征。还集成了后处理模块,使用从训练数据得出的排除列表执行括号校正,缩写解析和过滤错误提及。所开发的方法被实现为文档注释工具和Web服务,可从http://?bioinformatics。?ua。?pt /?becas-chemicals /?免费获得。 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号