...
首页> 外文期刊>JMIR Medical Informatics >Word Embedding for the French Natural Language in Health Care: Comparative Study
【24h】

Word Embedding for the French Natural Language in Health Care: Comparative Study

机译:法语中自然语言在卫生保健中的词嵌入:比较研究

获取原文
           

摘要

Background Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. Objective The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. Methods Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. Results Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. Conclusions Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.
机译:背景技术词嵌入技术是自然语言处理(NLP)中的一组语言建模和功能学习技术,现在已广泛用于各种应用中。但是,对于在同一个数据集上进行训练的3种当前最著名的无监督实现(Word2Vec,GloVe和FastText)中的每一种跟踪单词之间存在的语义相似性的能力,都没有进行正式的评估和比较。目的本研究的目的是比较在专业背景下制作的法国健康相关文献集上训练的嵌入方法。最好的方法将帮助我们开发新的语义注释器。方法已经对来自鲁昂大学医院的641,279份文档进行了无监督嵌入模型的训练。这些数据不是结构化的,并且涵盖了在临床环境中产生的各种文档(出院摘要,程序报告和处方)。总共定义了4个评分评估任务(余弦相似度,奇数,基于类比的运算和人类形式评估),并将其应用于每个模型以及嵌入可视化。结果Word2Vec在4个已评分任务(基于类比的操作,单数相似性和人工验证)中的3个中得分最高,尤其是在跳过语法架构方面。结论尽管此实现的语义属性保留率最高,但是每种模型都有其自身的质量和缺陷,例如训练时间(对于GloVe来说很短)或使用FastText观察到的形态相似性。该研究产生的模型和测试集将是第一个通过图形界面公开提供的模型,以帮助推进法国生物医学研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号