Word Embedding for the French Natural Language in Health Care: Comparative Study

Emeric Dynomant; Romain Lelong; Badisse Dahamna; Clément Massonnaud; Gaétan Kerdelhué; Julien Grosjean; Stéphane Canu; Stefan J Darmoni

首页> 外文期刊>JMIR Medical Informatics >Word Embedding for the French Natural Language in Health Care: Comparative Study

【24h】

Word Embedding for the French Natural Language in Health Care: Comparative Study

机译：法语中自然语言在卫生保健中的词嵌入：比较研究

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. Objective The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. Methods Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. Results Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. Conclusions Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.

机译：背景技术词嵌入技术是自然语言处理（NLP）中的一组语言建模和功能学习技术，现在已广泛用于各种应用中。但是，对于在同一个数据集上进行训练的3种当前最著名的无监督实现（Word2Vec，GloVe和FastText）中的每一种跟踪单词之间存在的语义相似性的能力，都没有进行正式的评估和比较。目的本研究的目的是比较在专业背景下制作的法国健康相关文献集上训练的嵌入方法。最好的方法将帮助我们开发新的语义注释器。方法已经对来自鲁昂大学医院的641,279份文档进行了无监督嵌入模型的训练。这些数据不是结构化的，并且涵盖了在临床环境中产生的各种文档（出院摘要，程序报告和处方）。总共定义了4个评分评估任务（余弦相似度，奇数，基于类比的运算和人类形式评估），并将其应用于每个模型以及嵌入可视化。结果Word2Vec在4个已评分任务（基于类比的操作，单数相似性和人工验证）中的3个中得分最高，尤其是在跳过语法架构方面。结论尽管此实现的语义属性保留率最高，但是每种模型都有其自身的质量和缺陷，例如训练时间（对于GloVe来说很短）或使用FastText观察到的形态相似性。该研究产生的模型和测试集将是第一个通过图形界面公开提供的模型，以帮助推进法国生物医学研究。

著录项

来源
《JMIR Medical Informatics》 |2019年第3期|共21页
作者
Emeric Dynomant; Romain Lelong; Badisse Dahamna; Clément Massonnaud; Gaétan Kerdelhué; Julien Grosjean; Stéphane Canu; Stefan J Darmoni;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词
natural language processingdata miningdata curation;

机译：自然语言处理数据挖掘数据管理;

相似文献

外文文献
中文文献
专利

1. Why Can Computers Understand Natural Language?·The Structuralist Image of Language Behind Word Embeddings [J] . Juan Luis Gastaldi Philosophy & technology . 2021,第1期

机译：为什么计算机可以了解自然语言？·语言嵌入词背后的语言的形象
2. Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding [J] . Li Xiaochen, Jiang He, Kamei Yasutaka, IEEE Transactions on Software Engineering . 2020,第10期

机译：桥接自然语言和API之间的语义间隙，用词嵌入
3. Convolution-deconvolution word embedding: An end-to-end multi-prototype fusion embedding method for natural language processing [J] . Shuang Kai, Zhang Zhixuan, Loo Jonathan, Information Fusion . 2020,第期

机译：卷积 - 解卷积词嵌入：用于自然语言处理的端到端多型融合嵌入方法
4. Word Embedding for French Natural Language in Healthcare: A Comparative Study [C] . Emeric Dynomant, Romain Lelong, Badisse Dahamna, MEDINFO . 2019

机译：在医疗保健中嵌入法国自然语言的单词：比较研究
5. Wh-existential words: A comparative study of English-Chinese and Korean-Chinese interlanguages. [D] . Chu, Wei. 2014

机译：Wh存在词：英汉，韩汉两种中介语的比较研究。
6. A Comparison of Word Embeddings for the Biomedical Natural Language Processing [O] . Yanshan Wang, Sijia Liu, Naveed Afzal, -1

机译：生物医学自然语言处理中的词嵌入比较
7. Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance [O] . Asgari, Ehsaneddin, Mofrad, Mohammad R. K. 2016

机译：用50种语言比较50种自然语言和12种基因语言词语嵌入语言发散（WELD）作为一种定量测量语言距离

Word Embedding for the French Natural Language in Health Care: Comparative Study

摘要

著录项

相似文献

相关主题

期刊订阅