首页> 外文期刊>BMC Bioinformatics >NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
【24h】

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

机译:NERBio:使用选定的单词连接词,术语归一化和全局模式来改善生物医学命名实体的识别

获取原文
           

摘要

Background Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein , the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. Results To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features. Conclusion We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.
机译:背景技术生物医学命名实体识别(Bio-NER)是一个具有挑战性的问题,因为通常,同一类别的生物医学命名实体(例如蛋白质和基因)不遵循一种标准命名法。它们有许多不规则之处,有时会出现在含糊的环境中。近年来,机器学习(ML)方法变得越来越普遍,现在代表了Bio-NER技术的最前沿。本文解决了基于ML的Bio-NER系统面临的三个问题。首先,大多数机器学习方法通​​常采用具有一个语言属性(例如,当前单词为大写字母)和至少一个类别标签(例如,B-protein,蛋白质名称的开头)的单例特征。但是,在必须考虑多种属性的情况下,此类功能可能不足。添加包含多个属性的连接特征可能是有益的,但由于内存资源有限且某些功能无效,因此将所有连接特征包括在NER模型中是不可行的。为了解决该问题,我们使用顺序正向搜索算法来选择一组有效的特征。其次,生物医学术语的数值部分的变化(例如,生物医学术语IL2中的“ 2”)导致数据稀疏并产生许多冗余特征。在这种情况下,我们应用数值归一化,通过用一个代表数字替换术语中的所有数字以帮助对命名实体进行分类来解决该问题。第三,NE标签的分配不仅仅取决于目标词的最接近的邻居,而是可以取决于上下文窗口之外的词(例如,五个上下文窗口由当前词加上两个前面的词和两个后面的词组成)。我们使用Smith-Waterman局部比对算法生成的全局模式来识别此类结构并修改基于ML的标记器的结果。这称为基于模式的后处理。结果为了开发基于ML的Bio-NER系统,我们使用条件随机字段作为基础ML模型,该条件随机字段在一些众所周知的任务中已经有效执行。添加选定的连词特征,应用数值归一化以及使用基于模式的后处理,可使F得分分别提高1.67%,1.04%和0.57%。组合增加了3.28%,总得分为72.98%,这比仅使用单例特征的基线系统要好。结论我们证明了使用顺序正向搜索算法选择有效连词特征组的好处。此外,我们表明数值归一化可以有效地减少冗余和看不见的特征的数量。此外,Smith-Waterman局部比对算法可以帮助基于ML的Bio-NER处理需要更长上下文窗口的困难情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号