首页> 外文期刊>Genome research >Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.
【24h】

Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.

机译:高效的基于频率的从头短读聚类,可用于下一代测序中的误差校正。

获取原文
获取原文并翻译 | 示例
           

摘要

Novel massively parallel sequencing technologies provide highly detailed structures of transcriptomes and genomes by yielding deep coverage of short reads, but their utility is limited by inadequate sequencing quality and short-read lengths. Sequencing-error trimming in short reads is therefore a vital process that could improve the rate of successful reference mapping and polymorphism detection. Toward this aim, we herein report a frequency-based, de novo short-read clustering method that organizes erroneous short sequences originating in a single abundant sequence into a tree structure; in this structure, each "child" sequence is considered to be stochastically derived from its more abundant "parent" sequence with one mutation through sequencing errors. The root node is the most frequently observed sequence that represents all erroneous reads in the entire tree, allowing the alignment of the reliable representative read to the genome without the risk of mapping erroneous reads to false-positive positions. This method complements base calling and the error correction of making direct alignments with the reference genome, and is able to improve the overall accuracy of short-read alignment by consulting the inherent relationships among the entire set of reads. The algorithm runs efficiently with a linear time complexity. In addition, an error rate evaluation model can be derived from bacterial artificial chromosome sequencing data obtained in the same run as a control. In two clustering experiments using small RNA and 5'-end mRNA reads data sets, we confirmed a remarkable increase ( approximately 5%) in the percentage of short reads aligned to the reference sequence.
机译:新型大规模并行测序技术通过产生对短读段的深度覆盖来提供转录组和基因组的高度详细的结构,但其实用性受到测序质量和短读段长度不足的限制。因此,短读中的测序错误修整是至关重要的过程,可以提高成功的参考定位和多态性检测的速度。为了实现这一目标,我们在此报告了一种基于频率的从头开始的短读聚类方法,该方法将源自单个丰富序列的错误短序列组织为树结构;在这种结构中,每个“子”序列被认为是随机地从其更丰富的“亲本”序列中衍生出来的,并且由于测序错误而发生了一次突变。根节点是代表整个树中所有错误读物的最常观察到的序列,可将可靠的代表性读物与基因组进行比对,而不会将错误读物定位到假阳性位置。该方法补充了碱基检出和与参考基因组进行直接比对的错误校正,并且能够通过查询整个读取集之间的固有关系来提高短阅读比对的总体准确性。该算法以线性时间复杂度高效运行。此外,错误率评估模型可以从与对照相同的运行中获得的细菌人工染色体测序数据中得出。在使用小RNA和5'端mRNA读数数据集的两个聚类实验中,我们证实了与参考序列对齐的短读数百分比显着增加(大约5%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号