首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >WarpDrive: Massively Parallel Hashing on Multi-GPU Nodes
【24h】

WarpDrive: Massively Parallel Hashing on Multi-GPU Nodes

机译:WarpDrive:多GPU节点上的大规模并行散列

获取原文

摘要

Hash maps are among the most versatile data structures in computer science because of their compact data layout and expected constant time complexity for insertion and querying. However, associated memory access patterns during the probing phase are highly irregular resulting in strongly memory-bound implementations. Massively parallel accelerators such as CUDA-enabled GPUs may overcome this limitation by virtue of their fast video memory featuring almost one TB/s bandwidth in comparison to main memory modules of state-of-the-art CPUs with less than 100 GB/s. Unfortunately, the size of hash maps supported by existing single-GPU hashing implementations is restricted by the limited amount of available video RAM. Hence, hash map construction and querying that scales across multiple GPUs is urgently needed in order to support structured storage of bigger datasets at high speeds. In this paper, we introduce WarpDrive - a scalable, distributed single-node multi-GPU implementation for the construction and querying of billions of key-value pairs. We propose a novel subwarp-based probing scheme featuring coalesced memory access over consecutive memory regions in order to mitigate the high latency of irregular access patterns. Our implementation achieves 1.4 billion insertions per second in single-GPU mode for a load factor of 0.95 thereby outperforming the GPU-cuckoo implementation of the CUDPP library by a factor of 2.8 on a P100. Furthermore, we present transparent scaling to multiple GPUs within the same node with up to 4.3 billion operations per second for high load factors on four P100 GPUs connected by NVLink technology. WarpDrive is free software and can be downloaded at https://github.com/sleeepyjack/warpdrive.
机译:哈希地图是计算机科学中最通用的数据结构之一,因为它们具有紧凑的数据布局和预期的插入和查询的恒定时间复杂性。然而,探测阶段期间的相关存储器访问模式非常不规则,从而产生强存储器结合的实现。诸如CUDA的GPU的大规模并行加速器可以通过与少于100 GB / s的最先进CPU的主要内存模块相比,通过几乎具有几乎一个TB / S带宽来克服这种限制。遗憾的是,现有单GPU散列实现支持的哈希贴图的大小受到有限的可用视频RAM的限制。因此,哈希映射构造和查询多个GPU跨越多个GPU的尺度,以便以高速支持更大的数据集的结构化存储。在本文中,我们介绍了Warpdrive - 一种可扩展的分布式单节点多GPU实现,用于施工和查询数十亿个键值对。我们提出了一种基于新的基于子狼的探测方案,其具有连续存储区域的聚结的存储器访问,以减轻不规则访问模式的高延迟。我们的实施在单个GPU模式下实现了14亿个插入,用于负载系数为0.95,从而优于P100的GPU-Cuckoo在CUDPP库的GPU-CUCKOO实施。此外,我们在同一节点内向多个GPU呈现透明的缩放,对于通过NVLink技术连接的四个P100 GPU,每秒高达43亿次操作。 Warpdrive是免费软件,可以在https://github.com/sleeepyjack/warpdrive下载。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号