首页> 外文会议>Asia-Pacific Bioinformatics Conference >Repeat-aware modeling and correction of short read errors
【24h】

Repeat-aware modeling and correction of short read errors

机译:重复感知建模和简短读取错误的校正

获取原文

摘要

Background: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short readsequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of /oners in reads and validating those withfrequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous /oner may be frequently observed if it has few nucleotide differences with valid /cmers with multiple occurrences in the genome. Error detection and correctionwere mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. Results: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of /oners from their observed frequencies by analyzing the misread relationships among observed /cmers. We also propose a method to estimate the threshold useful for validating /cmers whoseestimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a frameworkto model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available underGNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php? id=redeem". Conclusions: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detectingand correcting errors for genomes with high repeat content.
机译:背景:高吞吐量短读取测序通过使基因组和转录om的成本效益的深度覆盖序列来彻底改变基因组学和系统生物学研究。错误检测和校正对于许多短读取序列应用是至关重要的,包括DE Novo基因组测序,基因组重新序列和数字基因表达分析。短读取错误检测通常通过计算读取中的观察到/ oners的频率并验证超过阈值的那些验证的频率来执行。在具有高重复含量的基因组的情况下,如果在基因组中具有很少的核苷酸差异,可以经常观察到错误/碎片。错误检测和校正主要应用于具有低重复内容的基因组,这仍然是具有高重复内容的基因组的具有挑战性问题。结果:我们在存在基因组重复情况下开发统计模型和计算方法,用于错误检测和校正。我们提出了一种方法,通过分析观察/ CMERS之间的误差关系来推断出从观察到的频率从观察到的频率推断出基因组频率。我们还提出了一种方法来估计可用于验证/ CMERS最期基因组频率超过阈值的阈值。我们证明使用这些方法实现了卓越的错误检测。此外,我们在读取中断开均匀分布错误的共同假设,并提供许多短读平台共用的框架模型依赖性误差发生频率。最后,我们在具有高重复内容的基因组中获得更好的纠错。可用性:该软件是在C ++中实现的,并在“http://aluru-sun.ece.iastate.edu/doku.php?ID = reaveem”中自由地提供Undergnu GPL3许可证并加强软件V1.0许可证。结论:我们介绍了一个统计框架,以在下一代读取中模拟测序误差,这导致了具有高重复内容的基因组的检测和校正误差的有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号