NVIDIA CUDA
统一计算设备架构
编程指南
Version 1.1
11/29/2007
Version 1.1
I
CUDA
编程指南
II
Version 1.1
CUDA
编程指南
目 录
CUDA
...................................................................................................................................................... 1
1.1
.................................................................................................... 1
1.2 CUDA
GPU
.......................................................................................................... 3
1.3
................................................................................................................................................... 6
............................................................................................................................................................ 7
............................................................................................................................... 7
................................................................................................................................................... 7
.............................................................................................................................................. 7
...................................................................................................................................... 8
2.3
................................................................................................................................................. 10
........................................................................................................................................................ 13
.......................................................................................... 13
................................................................................................................................................. 14
................................................................................................................................................. 15
................................................................................................................................................. 16
......................................................................................................................................... 16
................................................................................................................................................ 17
多处理器
4.1 C
...................................................................................................................................... 17
4.2
................................................................................................................................................. 17
............................................................................................................................ 18
............................................................................................................................ 19
........................................................................................................................................ 21
........................................................................................................................................ 21
4.2.5
NVCC
........................................................................................................................... 22
4.3
..................................................................................................................................... 23
1
2
3
4
第第第第
第第第第
第第第第
第第第第
章章章章
章章章章
章章章章
章章章章
2.1
2.2
3.1
3.2
3.3
3.4
3.5
SIMD
2.2.2
2.2.1
简介简介简介简介
作为数据并行计算设备的图形处理器
:一种
计算的新架构
文档结构
编程模型编程模型编程模型编程模型
高度多线程协处理器
线程分批
线程块
线程块网格
内存模型
硬件实现硬件实现硬件实现硬件实现
具有片上共享内存的一组
执行模型
计算能力
多个设备
显示模式切换
应用编程接口
应用编程接口
应用编程接口
应用编程接口
编程语言扩展
语言扩展
函数类型限定符
变量类型限定符
执行配置
内置变量
使用
编译
共用运行时组件
内置向量类型
数学函数
时间函数
纹理类型
设备运行时组件
数学函数
编程指南
Version 1.1
4.2.1
4.2.2
4.2.3
4.2.4
4.3.1
4.3.2
4.3.3
4.3.4
4.4.1
CUDA
................................................................................................................................ 23
........................................................................................................................................ 24
........................................................................................................................................ 24
........................................................................................................................................ 24
4.4
..................................................................................................................................... 26
........................................................................................................................................ 26
III
4.4.6
4.4.5
4.4.4
4.4.3
5.1.2
5.1.1
4.5.3
4.5.2
4.5.1
同步函数
类型转换函数
类型强制函数
纹理函数
原子函数
宿主运行时组件
常用概念
运行时
驱动程序
性能指南性能指南性能指南性能指南
指令性能
指令吞吐量
内存带宽
每块的线程数
宿主和设备之间的数据传送
纹理拾取与全局或常量内存读取
整体性能优化策略
矩阵乘法示例
矩阵乘法示例
矩阵乘法示例
矩阵乘法示例
概述
源码清单
源码攻略
技术规格技术规格技术规格技术规格
通用规范
浮点标准
数学函数数学函数数学函数数学函数
共用运行时组件
设备运行时组件
原子函数原子函数原子函数原子函数
算术函数
5.2
5.3
5.4
5.5
6.2
6.3
A.1
A.2
B.1
B.2
6
第第第第
章章章章
第第第第
章章章章
附录附录附录附录
附录附录附录附录
附录附录附录附录
IV
.................................................................................................................................... 47
........................................................................................................................................ 49
......................................................................................................................................... 62
................................................................................................................. 63
......................................................................................................... 63
................................................................................................................................. 64
................................................................................................................................................ 67
6.1
......................................................................................................................................................... 67
................................................................................................................................................. 69
................................................................................................................................................. 71
6.3.1 Mul() ............................................................................................................................................ 71
6.3.2 Muld() .......................................................................................................................................... 71
A
........................................................................................................................................................ 73
................................................................................................................................................ 74
................................................................................................................................................ 74
B
........................................................................................................................................................ 77
..................................................................................................................................... 77
..................................................................................................................................... 80
C
........................................................................................................................................................ 83
C.1
................................................................................................................................................. 83
C.1.1 atomicAdd() ............................................................................................................................... 83
C.1.2 atomicSub() ............................................................................................................................... 83
C.1.3 atomicExch() ............................................................................................................................. 83
4.4.2
........................................................................................................................................ 26
................................................................................................................................ 26
................................................................................................................................ 27
........................................................................................................................................ 27
........................................................................................................................................ 28
4.5
....................................................................................................................................... 28
........................................................................................................................................ 29
API ..................................................................................................................................... 32
API ................................................................................................................................. 39
5
........................................................................................................................................................ 47
5.1
................................................................................................................................................. 47
Version 1.1
CUDA
编程指南
C.1.4 atomicMin() ............................................................................................................................... 84
C.1.5 atomicMax() ............................................................................................................................... 84
C.1.6 atomicInc() ............................................................................................................................... 84
C.1.7 atomicDec() ............................................................................................................................... 84
C.1.8 atomicCAS() ............................................................................................................................... 84
C.2
..................................................................................................................................................... 85
C.2.1 atomicAnd() ............................................................................................................................... 85
C.2.2 atomicOr() .................................................................................................................................. 85
C.2.3 atomicXor() ............................................................................................................................... 85
D
........................................................................................................................................... 87
D.1
.................................................................................................................................................. 87
D.1.1 cudaGetDeviceCount() .......................................................................................................... 87
D.1.2 cudaSetDevice() ...................................................................................................................... 87
D.1.3 cudaGetDevice() ...................................................................................................................... 87
D.1.4 cudaGetDeviceProperties() .............................................................................................. 88
D.1.5 cudaChooseDevice() .............................................................................................................. 89
附录附录附录附录
API
位函数
运行时运行时运行时运行时
参考参考参考参考
设备管理
线程管理
流管理
事件管理
内存管理
编程指南
D.2
.................................................................................................................................................. 89
D.2.1 cudaThreadSynchronize() .................................................................................................. 89
D.2.2 cudaThreadExit() ................................................................................................................... 89
D.3
...................................................................................................................................................... 89
D.3.1 cudaStreamCreate() .............................................................................................................. 89
D.3.2 cudaStreamQuery() ................................................................................................................. 89
D.3.3 cudaStreamSynchronize() .................................................................................................. 89
D.3.4 cudaStreamDestroy() ............................................................................................................ 89
D.4
.................................................................................................................................................. 90
D.4.1 cudaEventCreate() ................................................................................................................. 90
D.4.2 cudaEventRecord() ................................................................................................................. 90
D.4.3 cudaEventQuery() ................................................................................................................... 90
D.4.4 cudaEventSynchronize() ..................................................................................................... 90
D.4.5 cudaEventDestroy() .............................................................................................................. 90
D.4.6 cudaEventElapsedTime() ..................................................................................................... 90
D.5
.................................................................................................................................................. 91
D.5.1 cudaMalloc() ............................................................................................................................. 91
D.5.2 cudaMallocPitch() ................................................................................................................. 91
CUDA
Version 1.1
V
D.5.3 cudaFree() ................................................................................................................................. 91
D.5.4 cudaMallocArray() ................................................................................................................. 92
D.5.5 cudaFreeArray() ...................................................................................................................... 92
D.5.6 cudaMallocHost() ................................................................................................................... 92
D.5.7 cudaFreeHost() ........................................................................................................................ 92
D.5.8 cudaMemset() ............................................................................................................................. 92
D.5.9 cudaMemset2D() ........................................................................................................................ 92
D.5.10 cudaMemcpy() ............................................................................................................................. 93
D.5.11 cudaMemcpy2D() ........................................................................................................................ 93
D.5.12 cudaMemcpyToArray() ............................................................................................................ 94
D.5.13 cudaMemcpy2DToArray() ....................................................................................................... 94
D.5.14 cudaMemcpyFromArray() ....................................................................................................... 95
D.5.15 cudaMemcpy2DFromArray() .................................................................................................. 95
D.5.16 cudaMemcpyArrayToArray() ................................................................................................ 96
D.5.17 cudaMemcpy2DArrayToArray() ........................................................................................... 96
D.5.18 cudaMemcpyToSymbol() .......................................................................................................... 96
D.5.19 cudaMemcpyFromSymbol() ..................................................................................................... 96
D.5.20 cudaGetSymbolAddress() ..................................................................................................... 97
D.5.21 cudaGetSymbolSize() ............................................................................................................ 97
D.6
.......................................................................................................................................... 97
D.6.1
API .......................................................................................................................................... 97
D.6.2
API .......................................................................................................................................... 98
D.7
................................................................................................................................................ 100
D.7.1 cudaConfigureCall() .......................................................................................................... 100
D.7.2 cudaLaunch() ........................................................................................................................... 100
D.7.3 cudaSetupArgument() .......................................................................................................... 100
纹理参考管理
低层
高层
执行控制
互操作性
互操作性
D.8 OpenGL
................................................................................................................................. 100
D.8.1 cudaGLRegisterBufferObject()..................................................................................... 100
D.8.2 cudaGLMapBufferObject() ................................................................................................ 101
D.8.3 cudaGLUnmapBufferObject() ............................................................................................ 101
D.8.4 cudaGLUnregisterBufferObject() ................................................................................ 101
D.9 Direct3D
................................................................................................................................. 101
D.9.1 cudaD3D9Begin() .................................................................................................................... 101
D.9.2 cudaD3D9End() ........................................................................................................................ 101
VI
Version 1.1
CUDA
编程指南
D.9.3 cudaD3D9RegisterVertexBuffer() ................................................................................ 101
D.9.4 cudaD3D9MapVertexBuffer() ............................................................................................ 101
D.9.5 cudaD3D9UnmapVertexBuffer() ....................................................................................... 102
D.9.6 cudaD3D9UnregisterVertexBuffer() ........................................................................... 102
D.9.7 cudaD3D9GetDevice() .......................................................................................................... 102
D.10
................................................................................................................................................ 102
D.10.1 cudaGetLastError() ............................................................................................................ 102
D.10.2 cudaGetErrorString() ........................................................................................................ 102
E
...................................................................................................................................... 103
E.1
..................................................................................................................................................... 103
E.1.1 cuInit() ..................................................................................................................................... 103
E.2
................................................................................................................................................. 103
E.2.1 cuDeviceGetCount() ............................................................................................................. 103
E.2.2 cuDeviceGet() ......................................................................................................................... 103
E.2.3 cuDeviceGetName() ............................................................................................................... 103
E.2.4 cuDeviceTotalMem() ............................................................................................................. 104
E.2.5 cuDeviceComputeCapability() ....................................................................................... 104
E.2.6 cuDeviceGetAttribute() ................................................................................................... 104
E.2.7 cuDeviceGetProperties() ................................................................................................. 105
E.3
............................................................................................................................................. 106
E.3.1 cuCtxCreate() ......................................................................................................................... 106
E.3.2 cuCtxAttach() ......................................................................................................................... 106
E.3.3 cuCtxDetach() ......................................................................................................................... 106
E.3.4 cuCtxGetDevice() .................................................................................................................. 106
E.3.5 cuCtxSynchronize() ............................................................................................................. 106
E.4
................................................................................................................................................. 106
E.4.1 cuModuleLoad() ...................................................................................................................... 106
E.4.2 cuModuleLoadData() ............................................................................................................. 107
E.4.3 cuModuleLoadFatBinary() ................................................................................................. 107
E.4.4 cuModuleUnload() .................................................................................................................. 107
E.4.5 cuModuleGetFunction() ...................................................................................................... 107
E.4.6 cuModuleGetGlobal() .......................................................................................................... 107
E.4.7 cuModuleGetTexRef() .......................................................................................................... 108
E.5
..................................................................................................................................................... 108
CUDA
Version 1.1
VII
附录附录附录附录
参考参考参考参考
API
错误处理
驱动程序驱动程序驱动程序驱动程序
初始化
设备管理
上下文管理
模块管理
流管理
编程指南
E.5.1 cuStreamCreate() .................................................................................................................. 108
E.5.2 cuStreamQuery() .................................................................................................................... 108
E.5.3 cuStreamSynchronize() ...................................................................................................... 108
E.5.4 cuStreamDestroy() ............................................................................................................... 108
E.6
................................................................................................................................................. 108
E.6.1 cuEventCreate() .................................................................................................................... 108
E.6.2 cuEventRecord() .................................................................................................................... 108
E.6.3 cuEventQuery() ...................................................................................................................... 109
E.6.4 cuEventSynchronize() ........................................................................................................ 109
E.6.5 cuEventDestroy() .................................................................................................................. 109
E.6.6 cuEventElapsedTime() ........................................................................................................ 109
事件管理
执行控制
内存管理
E.7
............................................................................................................................................... 109
E.7.1 cuFuncSetBlockShape() .................................................................................................... 109
E.7.2 cuFuncSetSharedSize() ...................................................................................................... 110
E.7.3 cuParamSetSize() .................................................................................................................. 110
E.7.4 cuParamSeti() ......................................................................................................................... 110
E.7.5 cuParamSetf() ......................................................................................................................... 110
E.7.6 cuParamSetv() ......................................................................................................................... 110
E.7.7 cuParamSetTexRef() ............................................................................................................. 110
E.7.8 cuLaunch() ................................................................................................................................ 110
E.7.9 cuLaunchGrid() ...................................................................................................................... 111
E.8
................................................................................................................................................. 111
E.8.1 cuMemGetInfo() ...................................................................................................................... 111
E.8.2 cuMemAlloc() ........................................................................................................................... 111
E.8.3 cuMemAllocPitch() ............................................................................................................... 111
E.8.4 cuMemFree().............................................................................................................................. 112
E.8.5 cuMemAllocHost() .................................................................................................................. 112
E.8.6 cuMemFreeHost() .................................................................................................................... 112
E.8.7 cuMemGetAddressRange() ................................................................................................... 112
E.8.8 cuArrayCreate() .................................................................................................................... 113
E.8.9 cuArrayGetDescriptor() ................................................................................................... 114
E.8.10 cuArrayDestroy() .................................................................................................................. 114
E.8.11 cuMemset() ................................................................................................................................ 114
E.8.12 cuMemset2D() ........................................................................................................................... 114
VIII
CUDA
Version 1.1
编程指南