...
首页> 外文期刊>Genome research >Ancestry-agnostic estimation of DNA sample contamination from sequence reads
【24h】

Ancestry-agnostic estimation of DNA sample contamination from sequence reads

机译:血液读取DNA样品污染的血症 - 无症式估计

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Detecting and estimating DNA sample contamination are important steps to ensure high-quality genotype calls and reliable downstream analysis. Existing methods rely on population allele frequency information for accurate estimation of contamination rates. Correctly specifying population allele frequencies for each individual in early stage of sequence analysis is impractical or even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. On the other hand, incorrectly specified allele frequencies may result in substantial bias in estimated contamination rates. For example, we observed that existing methods often fail to identify 10% contaminated samples at a typical 3% contamination exclusion threshold when genetic ancestry is misspecified. Such an incomplete screening of contaminated samples substantially inflates the estimated rate of genotyping errors even in deeply sequenced genomes and exomes. We propose a robust statistical method that accurately estimates DNA contamination and is agnostic to genetic ancestry of the intended or contaminating sample. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele frequencies projected from reference genotypes onto principal component coordinates. Our method can also be used for estimating genetic ancestries, similar to LASER or TRACE, but simultaneously accounting for potential contamination. We demonstrate that our method robustly estimates contamination rates and genetic ancestries across populations and contamination scenarios. We further demonstrate that, in the presence of contamination, genetic ancestry inference can be substantially biased with existing methods that ignore contamination, while our method corrects for such biases.
机译:检测和估算DNA样品污染是确保高质量的基因型呼叫和可靠下游分析的重要步骤。现有方法依赖于群体等位基因频率信息,以准确估计污染速率。在序列分析的早期阶段正确指定每个单独的人物等位基因频率是不切实际的,或者甚至不可能对大规模测序中心同时处理来自各种群体的多种研究的样本。另一方面,错误指定的等位基因频率可能导致估计的污染速率大幅度偏差。例如,我们观察到,当遗传血清被遗漏时,现有方法通常在典型的3%污染排除阈值下仍未识别10%的污染样品。这种不完全筛选的污染样品即使在深度测序的基因组和exomes中,均显着膨胀基因分型误差率。我们提出了一种稳健的统计方法,可准确估计DNA污染,并且对预期或污染样品的遗传血管无关。我们的方法通过利用从参考基因型中投射到主成分坐标的单个特异性等位基因频率来集成统一似然框架中的遗传血症血液血液和DNA污染的估计。我们的方法还可用于估计基因型血糖,类似于激光或痕迹,但同时占潜在的污染。我们展示了我们的方法跨越人群和污染方案估计污染率和遗传患者。我们进一步证明,在存在污染的情况下,遗传血统推断可以基本上与忽略污染的现有方法偏置,而我们的方法校正了这种偏差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号