首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >de novo repeat detection based on the third generation sequencing reads
【24h】

de novo repeat detection based on the third generation sequencing reads

机译:从头开始基于第三代测序读取的重复检测

获取原文

摘要

Repetitive sequences refer to fragments that appear at more than one location in a genome. Numerous studies have shown that the repetitive sequences in genomes play indispensable roles in the evolution, inheritance, variation, gene expression, transcriptional regulation, chromosome construction, and physiological metabolism of organisms. In many sequence and genome analyses such as read alignment, de novo assembly and genome annotation, repetitive sequences can pose major challenges. Detection and classification of repeats is one of the main steps for genome sequence analysis in bioinformatics. However, most existing de novo detection methods are difficult to achieve satisfactory results for marking repetitive regions in both size and accuracy due to the NGS reads are too short to identify long repeats and the raw SMS long reads are with the high error rates. In this study, we present a new de novo repeat detection method called DLR (Detection of Long Repeats) based on PacBio long reads. DLR first converts all long reads into unique k-mers of a certain length, and screens out the k-mers with the high frequency. Then, these high frequency k-mers are aligned to long reads by using multiple sequence alignment, and the high frequency regions on long reads that are covered by those high frequency k-mers are recorded. Finally, the recorded high frequency regions with inclusion relations are merged and the final repetitive sequences are obtained. The experimental results show that DLR achieves optimal results in terms of effective size and accuracy compared with other existing algorithms.
机译:重复序列是指出现在基因组中一个以上位置的片段。大量研究表明,基因组中的重复序列在生物体的进化,遗传,变异,基因表达,转录调控,染色体构建和生理代谢中起着不可或缺的作用。在许多序列和基因组分析中,例如阅读比对,从头组装和基因组注释,重复序列可能构成重大挑战。重复序列的检测和分类是生物信息学中基因组序列分析的主要步骤之一。然而,由于NGS读数太短以至于不能识别长重复序列,并且原始SMS长读数具有高错误率,因此大多数现有的从头检测方法难以在尺寸和准确性上都达到令人满意的结果来标记重复区域。在这项研究中,我们提出了一种新的从头开始重复检测的方法,该方法基于PacBio的长读,称为DLR(长重复检测)。 DLR首先将所有长读片段转换为一定长度的独特k-mer,然后以高频率筛选出k-mer。然后,通过使用多个序列比对将这些高频k聚体与长读段对准,并记录那些高频k聚体所覆盖的长读段上的高频区域。最后,合并具有包含关系的记录的高频区域,并获得最终的重复序列。实验结果表明,与其他现有算法相比,DLR在有效大小和准确性方面均达到了最佳结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号