【24h】

Genome Assembly on a Multicore System

机译:在多核系统上的基因组组装

获取原文

摘要

The genome assembly problem is to generate the original DNA sequence of an organism from a large set of short (400bp-500bp) overlapping fragments. The assembly problem is challenging particularly in presence of repeats, which are multiple identical or nearly identical stretches of DNA. MIRA is an open source assembler, which is widely used by biologist and works effectively in presence of repeats. However, it is computation intensive, for example an assembly of one million fragments requires about 18.3 hours. The computation in MIRA assembler is dominated by the contigs building phase, which is highly sequential in nature. In this paper, we propose a modification to MIRA assembler that allows this computation to be parallelized while maintaining the quality of the assembly. We implemented the modified MIRA assembler on a 64-core system with eight Intel(R) Xeon(R) X7560 processors. We were able to speedup the building contigs phase by a factor of 55 on the 64-core system. Additionally, we parallelized the other phases of the MIRA assembler and were able to reduce the total sequential execution time of assembly from 18.3 hours to 3.4 hours (speedup of 5.57) without sacrificing assembly quality. It is worth noting that the overall speedup is limited by Amdahl's Law as parts of original MIRA assembler are inherently sequential. For example for one million reads the sequential portion of the MIRA assembler takes about 2.78 hours doing I/O or other operations which limits the overall speedup to 6.58.
机译:基因组组装问题是从大组短(400bp-500bp)重叠碎片产生生物体的原始DNA序列。组装问题特别是在存在重复存在的情况下具有挑战性,这是多种相同或几乎相同的DNA的延伸。 MIRA是一个开源汇编器,它被生物学家广泛使用,并在重复存在下有效地工作。然而,它是计算密集型,例如一百万片碎片的组装需要大约18.3小时。 Mira汇编程序中的计算由Contigs构建阶段主导,其在性质上是高度顺序的。在本文中,我们提出了对Mira汇编器的修改,该Mira汇编器允许该计算在保持组件的质量的同时并行化。我们在具有八个英特尔(R)Xeon(R)X7560处理器的64核系统上实现了修改的Mira汇编程序。我们能够在64核系统上加速建筑物Contigs阶段55倍。另外,我们并将其他阶段平行化了Mira汇编器的其他阶段,并且能够将组装的总顺序执行时间从18.3小时减少到3.4小时(加速5.57),而不会牺牲装配质量。值得注意的是,由于原始Mira汇编器的一部分固有地,整体加速度受Amdahl的定律限制。例如,一百万读取Mira汇编程序的顺序部分需要大约2.78小时的执行I / O或其他操作,这些操作将整体加速限制为6.58。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号