...
首页> 外文期刊>BMC Bioinformatics >Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
【24h】

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

机译:使用概率de Bruijn图对高通量测序数据进行无参考压缩

获取原文
           

摘要

Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. Leon was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. Leon is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/ .
机译:下一代测序(NGS)技术生成的数据量现在是数据存储和传输的主要问题。这引发了比通用压缩工具(例如,广泛使用的gzip方法)更有效的方法的需求。我们提出了一种新颖的无参考方法,旨在压缩从高通量测序技术发出的数据。我们的方法是在软件Leon中实现的,它采用了从现有装配原理中衍生的技术。该方法基于参考概率de Bruijn图,该参考概率是从一组读取中重新构建并存储在Bloom过滤器中的。通过记忆锚定kmer和分叉列表,每个读数在该图中被编码为路径。使用相同的概率de Bruijn图执行质量得分的有损转换,从而可以获取更高的压缩率,而不会丢失下游分析的相关信息。 Leon在各种实际测序数据集(整个基因组,外显子组,RNA-seq或宏基因组学)上运行。在所有情况下,LEON的总压缩率均高于最新的压缩软件。在秀丽隐杆线虫全基因组测序数据集上,LEON将原始文件大小除以20以上。Leon是一种开放源代码软件,根据GNU affero GPL许可发行,可从http://gatb.inria.fr/software下载。 / leon /。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号