首页> 外文会议>Workshop on biomedical natural language processing >SimSem: Fast Approximate String Matching in Relation to Semantic Category Disambiguation
【24h】

SimSem: Fast Approximate String Matching in Relation to Semantic Category Disambiguation

机译:Simsem:与语义类别歧义有关的快速近似字符串

获取原文

摘要

In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest match in each of a collection of lexical resources. We collect lexical resources for a multitude of semantic categories from a variety of biomedi-cal domain sources. The combined resources, containing more than twenty million lexical items, are queried using a recently proposed fast and efficient approximate string matching algorithm that allows us to query large resources without severely impacting system performance. We evaluate our results on six corpora representing a variety of disambiguation tasks. While the integration of approximate string matching features is shown to substantially improve performance on one corpus, results are modest or negative for others. We suggest possible explanations and future research directions. Our lexical resources and implementation are made freely available for research purposes.
机译:在这项研究中,我们研究了快速近似字符串匹配的优点,以解决与拼写变体有关的挑战,并利用大规模词汇资源进行语义歧义。我们将字符串匹配结果集成到基于机器学习的歧义,通过使用一种新颖的特征集,该组件表示给定文本跨度的距离到最接近的词汇资源中的每种集合中的最近匹配。我们从各种BioMeDi-Cal域源中收集众多语义类别的词汇资源。使用最近提出的快速有效的近似字符串匹配算法询问包含超过二千多万种词汇项目的组合资源,该算法允许我们在没有严重影响系统性能的情况下查询大资源。我们评估我们的结果,六个代表各种消歧任务的Corpora。虽然近似字符串匹配功能的集成显示为在一个语料库上显着提高性能,但结果对于其他语料库来说是谦虚或负面的。我们建议可能的解释和未来的研究方向。我们的词汇资源和实施是免费用于研究目的的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号