首页> 外文会议>ICA3PP 2014 >C2CU : A CUDA C Program Generator for Bulk Execution of a Sequential Algorithm
【24h】

C2CU : A CUDA C Program Generator for Bulk Execution of a Sequential Algorithm

机译:C2CU:用于批量执行顺序算法的CUDA C程序生成器

获取原文

摘要

A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. Bulk execution of a sequential algorithm is to execute it for many independent inputs in turn or in parallel. The main contribution of this paper is to develop a tool that generates a CUDA C program for the bulk execution of an oblivious sequential algorithm. More specifically, our tool automatically converts a C language program describing an oblivious sequential algorithm into a CUDA C program that performs the bulk execution of the C language program. Generated C programs can be executed in CUDA-enabled GPUs. We have implemented CUDA C programs for the bulk execution of bitonic sorting algorithm, Floyd-Warshall algorithm, and Montgomery modulo multiplication. Our implementations running on GeForce GTX Titan for the bulk execution can be 199 times faster for bitonic sort, 54 times faster for Floyd-Warshall algorithm, and 78 times faster for Montgomery modulo multiplication, over the implementations on a single Intel Xeon CPU.
机译:如果在每次访问的地址不依赖于输入数据,则序列算法令人沮丧。包括矩阵计算,信号处理,排序,动态编程和加密/解密的许多重要任务可以由令人沮丧的顺序算法执行。批量执行顺序算法是依次或并行地执行许多独立输入。本文的主要贡献是开发一种工具,该工具为批量执行不希望的顺序算法而生成CUDA C程序。更具体地,我们的工具会自动将描述令人沮丧的连续算法描述为CUDA C程序的C语言程序,该程序执行C语言程序的批量执行。生成的C程序可以在支持CUDA的GPU中执行。我们已经实施了CUDA C计划,以便对BITONIC分类算法,FLOYD-WARSHALL算法和MONTGOMERY MODULO乘法进行批量执行。我们在BiForce GTX Titan上运行的实现可以为Bitonic Sort的速度更快,对于Floyd-Warshall算法速度快54倍,对于蒙哥马利Xeon CPU的实现,蒙哥马利Modulo乘法速度快78倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号