首页> 外文学位 >Ontology-based Semantic Harmonization of HIV-associated Common Data Elements for Integration of Diverse HIV Research Datasets.
【24h】

Ontology-based Semantic Harmonization of HIV-associated Common Data Elements for Integration of Diverse HIV Research Datasets.

机译:基于本体的与HIV相关的通用数据元素的语义协调,用于整合不同的HIV研究数据集。

获取原文
获取原文并翻译 | 示例

摘要

Our aims were to: 1) Characterize the semantic heterogeneity (SH) in the Human Immunodeficiency Virus (HIV) research domain; 2) Identify HIV-associated common data elements (CDEs) in empirically generated and knowledge-based resources; 3) Create a formal representation of HIV-associated CDEs in the form of an HIV-associated Entities in Research Ontology (HERO); 4) Assess the feasibility of using HERO to semantically harmonize HIV research data. Our approach was guided by information/knowledge theory and the DIKW (Data Information Knowledge Wisdom) hierarchical model.;Our systematized review of the literature revealed that synergistic use of both ontologies and CDEs included integration, interoperability, data exchange, and data standardization. Moreover, methods and tools included use of experts for CDE identification, the Unified Medical Language System, natural language processing, Extensible Markup Language, Health Level 7, and ontology development tools (e.g., Protege). Additionally, evaluation methods included expert assessment, quantification of mapping tasks between raters, assessment of interrater reliability, and comparison to established standards. We used these findings to inform our process for achieving the study aims.;For Aim 1, we analyzed eight disparate HIV-associated data dictionaries and developed a String Metric-assisted Assessment of Semantic Heterogeneity (SMASH) method, which aided identification of 127 (13%) homogeneous data element (DE) pairs and 1,048 (87%) semantically heterogeneous DE pairs. Most heterogeneous pairs (97%) were semantically-equivalent/syntactically-different, allowing us to determine that SH in the HIV research domain was high.;To achieve Aim 2, we used Clinicaltrials.gov, Google Search, and text mining in R to identify HIV-associated CDEs in HIV journal articles, HIV-associated datasets, AIDSinfo HIV/AIDS Glossary, AIDSinfo Drug Database, Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), and RxNORM (understood as prescription normalization). Two HIV experts then manually reviewed DEs from the journal articles and data dictionaries to confirm DE commonality and resolved semantic discrepancies through discussion. Ultimately, we identified 2,179 unique CDEs. Of all CDEs, data-driven approaches identified 2,055 (94%) (999 from the HIV/AIDS Glossary, 398 from the Drug Database, 91 from journal articles, and a total of 567 from LOINC, SNOMED, and RxNorm cumulatively). Expert-based approaches identified 124 (6%) unique CDEs from data dictionaries and confirmed the 91 CDEs from journal articles.;In Aim 3, we used the Protege suite of ontology development tools and the 2,179 CDEs to develop the HERO. We modeled the ontology using the semantic structure of the Medical Entities Dictionary, available hierarchical information from the CDE knowledge resources, and expert knowledge. The ontology fulfilled most relevant criteria from Cimino's desiderata and OntoClean ontology engineering principles, and it successfully answered eight competency questions.;Finally, for Aim 4, we assessed the feasibility of using HERO to semantically harmonize and integrate the data dictionaries from two diverse HIV-associated datasets. Two HIV experts involved in the development of HERO independently assessed each data dictionary. Of the 367 DEs in data dictionary 1 (D1), 181 (49.32%) were identified as CDEs and 186 (50.68%) were not CDEs, and of the 72 DEs in data dictionary 2 (D2), 37 (51.39%) were CDEs and 35 (48.61%) were not CDEs. The HIV experts then traversed HERO's hierarchy to map CDEs from D1 and D2 to CDEs in HERO. Of the 181 CDEs in D1, 156 (86.19%) were found in HERO, and 25 (13.81%) were not. Similarly, of the 37 CDEs in D2 32 (86.48%) were found in HERO, and 5 (13.51%) were not. Interrater reliability for CDE identification as measured by Cohen's Kappa was 0.900 for D1 and 0.892 for D2. Cohen's Kappas for CDEs in D1 and D2 that were also identified in HERO were 0.885 and 0.688, respectively.;Subsequently, to demonstrate the integration of the two HIV-associated datasets, a sample of semantically harmonized CDEs in both datasets was categorically selected (e.g. administrative, demographic, and behavioral), and D2 sample size increases were calculated for race (e.g., White, African American/Black, Asian/Pacific Islander, Native American/Indian, and Hispanic/Latino) and for "intravenous drug use" from the integrated datasets. The average increase of D2 CDEs for six selected CDEs was 1,928%.;Despite the limitation of HERO developers also serving as evaluators, the contributions of the study to the fields of informatics and HIV research were substantial. Confirmatory contributions include: identification of effective CDE/ontology tools, and use of data-driven and expert-based methods. Novel contributions include: development of SMASH and HERO; and new contributions include documenting that SH is high in HIV-associated datasets, identifying 2,179 HIV-associated CDEs, creating two additional classifications of SH, and showing that using HERO for semantic harmonization of HIV-associated data dictionaries is feasible. Our future work will build upon this research by expanding the numbers and types of datasets, refining our methods and tools, and conducting an external evaluation. (Abstract shortened by ProQuest.).
机译:我们的目标是:1)表征人类免疫缺陷病毒(HIV)研究领域中的语义异质性(SH); 2)在以经验为基础和以知识为基础的资源中确定与艾滋病毒有关的共同数据要素(CDE); 3)以研究本体中的HIV相关实体的形式创建与HIV相关的CDE的形式代表; 4)评估使用HERO在语义上协调HIV研究数据的可行性。我们的方法是在信息/知识理论和DIKW(数据信息知识智慧)层次模型的指导下进行的。我们对文献的系统化审查显示,本体和CDE的协同使用包括集成,互操作性,数据交换和数据标准化。此外,方法和工具包括使用专家进行CDE识别,统一医学语言系统,自然语言处理,可扩展标记语言,Health Level 7和本体开发工具(例如Protege)。此外,评估方法还包括专家评估,评估者之间映射任务的量化,对地物可靠性的评估以及与既定标准的比较。我们使用这些发现为实现研究目标的过程提供了信息。;对于目标1,我们分析了8个不同的HIV相关数据字典,并开发了String Metrics辅助评估语义异质性(SMASH)方法,该方法有助于鉴定127( 13%)同质数据元素(DE)对和1,048(87%)个语义异构DE对。大多数异类对(97%)在语义上等价/在语法上是不同的,这使我们能够确定HIV研究领域的SH很高。为了实现目标2,我们在R中使用Clinicaltrials.gov,Google Search和文本挖掘在HIV期刊文章,HIV相关数据集,AIDSinfo HIV / AIDS术语表,AIDSinfo药物数据库,逻辑观察标识符名称和代码(LOINC),医学系统命名法(SNOMED)和RxNORM(应理解为处方)中识别与HIV相关的CDE正常化)。然后,两名HIV专家从期刊文章和数据字典中手动检查了DE,以确认DE的通用性并通过讨论解决了语义上的差异。最终,我们确定了2179个独特的CDE。在所有CDE中,数据驱动的方法确定了2055个(94%)(HIV / AIDS术语表中的999个,药物数据库中的398个,期刊文章中的91个,LOINC,SNOMED和RxNorm总计567个)。基于专家的方法从数据词典中确定了124个(6%)独特的CDE,并从期刊文章中确认了91个CDE。在目标3中,我们使用Protege本体开发工具套件和2179个CDE来开发HERO。我们使用医学实体字典的语义结构,来自CDE知识资源的可用分层信息以及专家知识对本体进行建模。该本体满足了Cimino's desiderata和OntoClean本体工程原理中最相关的标准,并成功回答了8个能力问题。最后,对于目标4,我们评估了使用HERO语义上统一和整合来自两个不同HIV-的数据字典的可行性。相关数据集。两名参与HERO开发的HIV专家独立评估了每个数据字典。数据字典1(D1)中的367个DE中,有181个(49.32%)被确定为CDE,而186(50.68%)不是CDE,数据字典2(D2)中的72个DE中有37个(51.39%)是CDE。 CDE和35(48.61%)不是CDE。然后,HIV专家遍历HERO的层次结构,将CDE从D1和D2映射到HERO中的CDE。在D1的181个CDE中,有156个(86.19%)在HERO中发现,而25个(13.81%)没有。同样,在D2的37个CDE中,有32个(86.48%)在HERO中发现,而5个(13.51%)没有。通过Cohen的Kappa测得的CDE识别的评估者间可靠性对于D1为0.900,对于D2为0.892。同样在HERO中识别出的D1和D2中CDE的Cohen的Kappas分别为0.885和0.688 .;随后,为了证明两个HIV相关数据集的整合,在两个数据集中分类选择了语义协调的CDE样本(例如行政,人口和行为)和D2样本量的增加是针对种族(例如,白人,非裔美国人/黑人,亚洲/太平洋岛民,美洲原住民/印度和西班牙裔/拉丁美洲人)以及“静脉吸毒”集成数据集。六个选定的CDE的D2 CDE的平均增加量为1,928%。;尽管HERO开发人员还充当评估者,但该研究在信息学和HIV研究领域的贡献很大。确认性贡献包括:确定有效的CDE /本体论工具,以及使用数据驱动和基于专家的方法。新颖的贡献包括:SMASH和HERO的开发;新的贡献包括证明SH在HIV相关数据集中很高,确定了2179种与HIV相关的CDE,创建了SH的两个附加分类,并表明使用HERO进行与HIV相关的数据字典的语义协调是可行的。我们未来的工作将通过扩大数据集的数量和类型,完善我们的方法和工具以及进行外部评估来基于这项研究。 (摘要由ProQuest缩短。)。

著录项

  • 作者

    Brown, William, III.;

  • 作者单位

    Columbia University.;

  • 授予单位 Columbia University.;
  • 学科 Information science.;Information technology.;Health sciences.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 201 p.
  • 总页数 201
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号