首页> 外文期刊>Journal of Korean medical science. >Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
【24h】

Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach

机译:使用基于规则的方法从自由文本HLA报告中提取结构化基因型信息

获取原文
           

摘要

Background Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information. Methods We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation. Results Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892–0.999 precision and 0.795–0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules. Conclusion The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them.
机译:背景技术人白细胞抗原(HLA)键入对于移植患者来预防严重错配反应是重要的,结果也可以支持各种疾病的诊断或对药物副作用的预测。然而,HLA键入结果的这种次要应用是有限的,因为它们通常以自由文本格式或电子医疗记录的PDF提供。我们在此提出了一种通过提取Serotype /等位基因信息将以非结构化格式存储为可重复使用的结构化格式的HLA基因型信息。方法从2000年到2018年从2000年至2018年从2000年到2018年临床数据仓库询问HLA分类报告,作为规则开发数据集(64,024个报告),以及最近一年(6,181份报告)作为测试集。我们使用了基于规则的自然语言方法,使用Python Regex功能提取1)患者的报告中的患者数,2)临床特征,如HLA检测指示,3)精确的HLA基因型。通过从测试集的提取结果和由手动策策生成的验证集之间进行评估,评估规则和代码的性能。结果11,287个开发集报告中的报告和1,107用于描述单一患者的HLA键入的HLA,迭代规则生成开发了124个提取规则和HLA基因型的清洁规则。这些规则的应用提取HLA基因型,0.892-0.999精度,0.795-0.998召回五个HLA基因。报告中患者数量的提取规则的精度和回忆分别为0.997和0.994,临床可变萃取分别为0.997和0.992。所有提取的HLA等位基因和血清型通过清洁规则根据正式的HLA术语转化。结论该规则的HLA基因型提取方法显示可靠的精度。我们认为,当这种非使用者的遗传信息将返回它们时,有许多患者获利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号