首页> 美国卫生研究院文献>AMIA Summits on Translational Science Proceedings >Extracting Country-of-Origin from Electronic Health Records for Gene- Environment Studies as Part of the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) Study
【2h】

Extracting Country-of-Origin from Electronic Health Records for Gene- Environment Studies as Part of the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) Study

机译:从电子健康记录中提取起源国进行基因环境研究作为与环境相关的基因(EAGLE)研究的流行病学体系的一部分

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We describe here the extraction of country-of-origin, an acculturation variable relevant for gene-environment studies, in a biorepository linked to de-identified electronic health records (EHRs) assessed by the Epidemiologic Architecture for Genes Linked to Environment (EAGLE), a study site of the Population Architecture using Genomics and Epidemiology (PAGE) I study. We extracted country-of-origin from the unstructured clinical free text using regular expressions within the MySQL relational database system in a cohort of 15,863 subjects of mostly non-European descent (including 11,519 African Americans, 1,702 Hispanics, and 1,118 Asians). We performed searches for 231 world countries (including independent sovereign states, dependent areas, and disputed territories) and common misspellings in >14 gigabytes of data including >13 billion characters of clinical text. Manual review of a fraction of the initial country-of-origin assignments established rules for data cleaning and quality control to achieve final country-of-origin status for each subject. After data cleaning, a total of 1,911/15,893 (12.02%) subjects were assigned to a country-of-origin outside of the United States. Mexico was the most commonly assigned country outside of the United States (264 subjects; 13.8% of subjects with a foreign country-of-origin assignment). The distribution of the countries assigned followed expectations based on known migration patterns to the United States with an emphasis on the southeastern region. These data suggest country-of-origin can be successfully extracted from unstructured clinical text for downstream genetic association studies.
机译:在此,我们描述了与由环境相关基因流行病学体系(EAGLE)评估的已识别电子健康记录(EHR)相关的生物存储库中与基因环境研究相关的起源变量(与基因环境研究相关的适应变量)的提取,我研究的使用基因组学和流行病学(PAGE)的人口结构研究站点。我们使用MySQL关系数据库系统中的正则表达式从非结构化临床免费文本中提取了原籍国,该组包含15863名大多数非欧洲裔的受试者(包括11519名非洲裔美国人,1702名西班牙裔和1118名亚洲人)。我们在超过14 GB的数据(包括超过130亿个临床文本字符)中搜索了231个世界国家(包括独立主权国家,附属地区和有争议的领土)和常见的拼写错误。手工审查最初的原籍国工作的一部分,建立了数据清理和质量控制的规则,以实现每个主题的最终原籍国地位。数据清理后,总共有1,911 / 15,893(12.02%)位受试者被分配到美国境外的原籍国。墨西哥是美国以外最常见的国家(264个科目; 13.8%的国家/地区来自国外)。所分配国家的分布遵循了基于已知移民向美国的模式的预期,重点是东南地区。这些数据表明来源国可以成功地从非结构化临床文献中提取出来,用于下游遗传关联研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号