首页> 外文期刊>BMC Genomics >Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
【24h】

Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree

机译:通过考虑合并树,从多个个体的配对末端读数中进行短串联重复序列数估计

获取原文
       

摘要

Background Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches. Results We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees. Conclusions We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.
机译:背景技术对于高通量测序数据中短串联重复序列(STR)区域中的重复数估计,主要考虑两种类型的方法:直接计数跨越该区域的序列读数中包含的重复模式的方法和基于检测插入片段之间差异的方法。从对齐的配对末端读取和实际插入物大小推断出的大小。尽管用前一种方法估计的重复数的准确性很高,但目标STR区域的大小仅限于序列读取的长度。另一方面,后一种方法可以处理比序列读取长度更长的STR区域。但是,用后一种方法估计的重复数不如用前一种方法估计的重复数准确。结果我们提出了一个名为CoalescentSTR的新统计模型,该模型通过将每个个体的阅读生成模型与其谱系联系起来,同时从多个个体的配对末端阅读距离估计重复数。在该模型中,通过将聚结树作为隐藏变量来表示族谱,并且对隐藏树变量的求和是基于位于具有马尔可夫链蒙特卡洛的目标STR区域周围的分相基因型采样的聚结树上进行的。在采样的合并树中,将传播来自插入大小数据的重复数信息,并且对于比序列读取长度更长的STR区域,期望更准确的重复数估计。为了找到重复数,以使模型在重复数估计上的可能性最大化,我们在采样的聚结树上提出了一种最新的置信传播算法。结论我们通过使用模拟数据集以及在1000个基因组计划中分析的HapMap个体的真实全基因组和全外显子组数据,与现有方法进行了比较,验证了该方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号