forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathGPUTaskingcudaFlow.xml
More file actions
227 lines (227 loc) · 32.3 KB
/
Copy pathGPUTaskingcudaFlow.xml
File metadata and controls
227 lines (227 loc) · 32.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.12.0" xml:lang="en-US">
<compounddef id="GPUTaskingcudaFlow" kind="page">
<compoundname>GPUTaskingcudaFlow</compoundname>
<title>GPU Tasking (cudaFlow)</title>
<tableofcontents>
<tocsect>
<name>Include the Header</name>
<reference>GPUTaskingcudaFlow_1GPUTaskingcudaFlowIncludeTheHeader</reference>
</tocsect>
<tocsect>
<name>What is a CUDA Graph?</name>
<reference>GPUTaskingcudaFlow_1WhatIsACudaGraph</reference>
</tocsect>
<tocsect>
<name>Create a cudaFlow</name>
<reference>GPUTaskingcudaFlow_1Create_a_cudaFlow</reference>
</tocsect>
<tocsect>
<name>Compile a cudaFlow Program</name>
<reference>GPUTaskingcudaFlow_1Compile_a_cudaFlow_program</reference>
</tocsect>
<tocsect>
<name>Run a cudaFlow on Specific GPU</name>
<reference>GPUTaskingcudaFlow_1run_a_cudaflow_on_a_specific_gpu</reference>
</tocsect>
<tocsect>
<name>Create Memory Operation Tasks</name>
<reference>GPUTaskingcudaFlow_1GPUMemoryOperations</reference>
</tocsect>
<tocsect>
<name>Offload a cudaFlow</name>
<reference>GPUTaskingcudaFlow_1OffloadAcudaFlow</reference>
</tocsect>
<tocsect>
<name>Update a cudaFlow</name>
<reference>GPUTaskingcudaFlow_1UpdateAcudaFlow</reference>
</tocsect>
<tocsect>
<name>Integrate a cudaFlow into Taskflow</name>
<reference>GPUTaskingcudaFlow_1IntegrateCudaFlowIntoTaskflow</reference>
</tocsect>
</tableofcontents>
<briefdescription>
</briefdescription>
<detaileddescription>
<para>Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapter discusses how to implement CPU-GPU heterogeneous tasking algorithms with <ulink url="https://developer.nvidia.com/cuda-zone">Nvidia CUDA</ulink>.</para>
<sect1 id="GPUTaskingcudaFlow_1GPUTaskingcudaFlowIncludeTheHeader">
<title>Include the Header</title><para>You need to include the header file, <computeroutput>taskflow/cuda/cudaflow.hpp</computeroutput>, for creating a GPU task graph using <ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref>.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="cudaflow_8hpp" kindref="compound">taskflow/cuda/cudaflow.hpp</ref>></highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1WhatIsACudaGraph">
<title>What is a CUDA Graph?</title><para>CUDA Graph is a new execution model that enables a series of CUDA kernels to be defined and encapsulated as a single unit, i.e., a task graph of operations, rather than a sequence of individually-launched operations. This organization allows launching multiple GPU operations through a single CPU operation and hence reduces the launching overheads, especially for kernels of short running time. The benefit of CUDA Graph can be demonstrated in the figure below:</para>
<para><image type="html" name="cuda_graph_benefit.png"></image>
</para>
<para>In this example, a sequence of short kernels is launched one-by-one by the CPU. The CPU launching overhead creates a significant gap in between the kernels. If we replace this sequence of kernels with a CUDA graph, initially we will need to spend a little extra time on building the graph and launching the whole graph in one go on the first occasion, but subsequent executions will be very fast, as there will be very little gap between the kernels. The difference is more pronounced when the same sequence of operations is repeated many times, for example, many training epochs in machine learning workloads. In that case, the initial costs of building and launching the graph will be amortized over the entire training iterations.</para>
<para><simplesect kind="attention"><para>A comprehensive introduction about CUDA Graph can be referred to the <ulink url="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs">CUDA Graph Programming Guide</ulink>.</para>
</simplesect>
</para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1Create_a_cudaFlow">
<title>Create a cudaFlow</title><para>Taskflow leverages <ulink url="https://developer.nvidia.com/blog/cuda-graphs/">CUDA Graph</ulink> to enable concurrent CPU-GPU tasking using a task graph model called <ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref>. A cudaFlow manages a CUDA graph explicitly to execute dependent GPU operations in a single CPU call. The following example implements a cudaFlow that performs an saxpy (A·X Plus Y) workload:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="cudaflow_8hpp" kindref="compound">taskflow/cuda/cudaflow.hpp</ref>></highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>saxpy<sp/>(single-precision<sp/>A·X<sp/>Plus<sp/>Y)<sp/>kernel</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">__global__<sp/></highlight><highlight class="keywordtype">void</highlight><highlight class="normal"><sp/>saxpy(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>n,<sp/></highlight><highlight class="keywordtype">float</highlight><highlight class="normal"><sp/>a,<sp/></highlight><highlight class="keywordtype">float</highlight><highlight class="normal"><sp/>*x,<sp/></highlight><highlight class="keywordtype">float</highlight><highlight class="normal"><sp/>*y)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>i<sp/>=<sp/>blockIdx.x*blockDim.x<sp/>+<sp/>threadIdx.x;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">if</highlight><highlight class="normal"><sp/>(i<sp/><<sp/>n)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>y[i]<sp/>=<sp/>a*x[i]<sp/>+<sp/>y[i];</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>}</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>main<sp/>function<sp/>begins</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>main()<sp/>{</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">unsigned</highlight><highlight class="normal"><sp/>N<sp/>=<sp/>1<<20;<sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>size<sp/>of<sp/>the<sp/>vector</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="cpp/container/vector" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::vector<float></ref><sp/>hx(N,<sp/>1.0f);<sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>x<sp/>vector<sp/>at<sp/>host</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="cpp/container/vector" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::vector<float></ref><sp/>hy(N,<sp/>2.0f);<sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>y<sp/>vector<sp/>at<sp/>host</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordtype">float</highlight><highlight class="normal"><sp/>*dx{</highlight><highlight class="keyword">nullptr</highlight><highlight class="normal">};<sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>x<sp/>vector<sp/>at<sp/>device</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordtype">float</highlight><highlight class="normal"><sp/>*dy{</highlight><highlight class="keyword">nullptr</highlight><highlight class="normal">};<sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>y<sp/>vector<sp/>at<sp/>device</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>cudaMalloc(&dx,<sp/>N*</highlight><highlight class="keyword">sizeof</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">float</highlight><highlight class="normal">));</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>cudaMalloc(&dy,<sp/>N*</highlight><highlight class="keyword">sizeof</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">float</highlight><highlight class="normal">));</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref><sp/>cudaflow;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>data<sp/>transfer<sp/>tasks</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaTask" kindref="compound">tf::cudaTask</ref><sp/>h2d_x<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a02a041d5dd9e1e8958eb43e09331051e" kindref="member">copy</ref>(dx,<sp/>hx.data(),<sp/>N).name(</highlight><highlight class="stringliteral">"h2d_x"</highlight><highlight class="normal">);<sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaTask" kindref="compound">tf::cudaTask</ref><sp/>h2d_y<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a02a041d5dd9e1e8958eb43e09331051e" kindref="member">copy</ref>(dy,<sp/>hy.data(),<sp/>N).name(</highlight><highlight class="stringliteral">"h2d_y"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaTask" kindref="compound">tf::cudaTask</ref><sp/>d2h_x<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a02a041d5dd9e1e8958eb43e09331051e" kindref="member">copy</ref>(hx.data(),<sp/>dx,<sp/>N).name(</highlight><highlight class="stringliteral">"d2h_x"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaTask" kindref="compound">tf::cudaTask</ref><sp/>d2h_y<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a02a041d5dd9e1e8958eb43e09331051e" kindref="member">copy</ref>(hy.data(),<sp/>dy,<sp/>N).name(</highlight><highlight class="stringliteral">"d2h_y"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>launch<sp/>saxpy<<<(N+255)/256,<sp/>256,<sp/>0>>>(N,<sp/>2.0f,<sp/>dx,<sp/>dy)</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaTask" kindref="compound">tf::cudaTask</ref><sp/>kernel<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a1473a15a6023fbc25e1f029f2ff84aec" kindref="member">kernel</ref>(</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>(N+255)/256,<sp/>256,<sp/>0,<sp/>saxpy,<sp/>N,<sp/>2.0f,<sp/>dx,<sp/>dy</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>).name(</highlight><highlight class="stringliteral">"saxpy"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>kernel.<ref refid="classtf_1_1cudaTask_1a4a9ca1a34bac47e4c9b04eb4fb2f7775" kindref="member">succeed</ref>(h2d_x,<sp/>h2d_y)</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/>.<ref refid="classtf_1_1cudaTask_1abdd68287ec4dff4216af34d1db44d1b4" kindref="member">precede</ref>(d2h_x,<sp/>d2h_y);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>run<sp/>the<sp/>cudaflow<sp/>through<sp/>a<sp/>stream</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>cudaflow.run(stream)</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>dump<sp/>the<sp/>cudaflow</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1abd73a9268b80e74803f241ee10a842b6" kindref="member">dump</ref>(<ref refid="cpp/io/basic_ostream" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::cout</ref>);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para>The cudaFlow graph consists of two CPU-to-GPU data copies (<computeroutput>h2d_x</computeroutput> and <computeroutput>h2d_y</computeroutput>), one kernel (<computeroutput>saxpy</computeroutput>), and two GPU-to-CPU data copies (<computeroutput>d2h_x</computeroutput> and <computeroutput>d2h_y</computeroutput>), in this order of their task dependencies.</para>
<para><dotfile name="saxpy.dot"></dotfile>
</para>
<para>We do not expend yet another effort on simplifying kernel programming but focus on tasking CUDA operations and their dependencies. In other words, <ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref> is a lightweight C++ abstraction over CUDA Graph. This organization lets users fully take advantage of CUDA features that are commensurate with their domain knowledge, while leaving difficult task parallelism details to Taskflow.</para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1Compile_a_cudaFlow_program">
<title>Compile a cudaFlow Program</title><para>Use <ulink url="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html">nvcc</ulink> to compile a cudaFlow program:</para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">~$<sp/>nvcc<sp/>-std=c++17<sp/>my_cudaflow.cu<sp/>-I<sp/>path/to/include/taskflow<sp/>-O2<sp/>-o<sp/>my_cudaflow</highlight></codeline>
<codeline><highlight class="normal">~$<sp/>./my_cudaflow</highlight></codeline>
</programlisting></para>
<para>Please visit the page <ref refid="CompileTaskflowWithCUDA" kindref="compound">Compile Taskflow with CUDA</ref> for more details.</para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1run_a_cudaflow_on_a_specific_gpu">
<title>Run a cudaFlow on Specific GPU</title><para>By default, a cudaFlow runs on the current GPU context associated with the caller, which is typically GPU <computeroutput>0</computeroutput>. Each CUDA GPU has an integer identifier in the range of <computeroutput>[0, N)</computeroutput> to represent the context of that GPU, where <computeroutput>N</computeroutput> is the number of GPUs in the system. You can run a cudaFlow on a specific GPU by switching the context to a different GPU using <ref refid="classtf_1_1cudaScopedDevice" kindref="compound">tf::cudaScopedDevice</ref>. The code below creates a cudaFlow and runs it on GPU <computeroutput>2</computeroutput>.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="normal">{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>an<sp/>RAII-styled<sp/>switcher<sp/>to<sp/>the<sp/>context<sp/>of<sp/>GPU<sp/>2</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaScopedDevice" kindref="compound">tf::cudaScopedDevice</ref><sp/>context(2);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>cudaFlow<sp/>capturer<sp/>under<sp/>GPU<sp/>2</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaFlowCapturer" kindref="compound">tf::cudaFlowCapturer</ref><sp/>capturer;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>...</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>stream<sp/>under<sp/>GPU<sp/>2<sp/>and<sp/>offload<sp/>the<sp/>capturer<sp/>to<sp/>that<sp/>GPU</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>capturer.<ref refid="classtf_1_1cudaFlowCapturer_1a952596fd7c46acee4c2459d8fe39da28" kindref="member">run</ref>(stream);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para><ref refid="classtf_1_1cudaScopedDevice" kindref="compound">tf::cudaScopedDevice</ref> is an RAII-styled wrapper to perform <emphasis>scoped</emphasis> switch to the given GPU context. When the scope is destroyed, it switches back to the original context.</para>
<para><simplesect kind="attention"><para><ref refid="classtf_1_1cudaScopedDevice" kindref="compound">tf::cudaScopedDevice</ref> allows you to place a cudaFlow on a particular GPU device, but it is your responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU <computeroutput>2</computeroutput> while accessing it from a kernel on GPU <computeroutput>0</computeroutput>. An easy practice for multi-GPU programming is to allocate <emphasis>unified shared memory</emphasis> using <computeroutput>cudaMallocManaged</computeroutput> and let the CUDA runtime perform automatic memory migration between GPUs.</para>
</simplesect>
</para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1GPUMemoryOperations">
<title>Create Memory Operation Tasks</title><para>cudaFlow provides a set of methods for users to manipulate device memory. There are two categories, <emphasis>raw</emphasis> data and <emphasis>typed</emphasis> data. Raw data operations are methods with prefix <computeroutput>mem</computeroutput>, such as <computeroutput>memcpy</computeroutput> and <computeroutput>memset</computeroutput>, that operate in <emphasis>bytes</emphasis>. Typed data operations such as <computeroutput>copy</computeroutput>, <computeroutput>fill</computeroutput>, and <computeroutput>zero</computeroutput>, take <emphasis>logical count</emphasis> of elements. For instance, the following three methods have the same result of zeroing <computeroutput>sizeof(int)*count</computeroutput> bytes of the device memory area pointed to by <computeroutput>target</computeroutput>.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>target;</highlight></codeline>
<codeline><highlight class="normal">cudaMalloc(&target,<sp/>count*</highlight><highlight class="keyword">sizeof</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">));</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref><sp/>cudaflow;</highlight></codeline>
<codeline><highlight class="normal">memset_target<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a10196f49de261a4042de328aab2452c8" kindref="member">memset</ref>(target,<sp/>0,<sp/></highlight><highlight class="keyword">sizeof</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">)<sp/>*<sp/>count);</highlight></codeline>
<codeline><highlight class="normal">same_as_above<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a32634c5645c14b99ceeaafe77ea5ea62" kindref="member">fill</ref>(target,<sp/>0,<sp/>count);</highlight></codeline>
<codeline><highlight class="normal">same_as_above_again<sp/>=<sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1ab45bc592a33380adf74d6f1e7690bd4c" kindref="member">zero</ref>(target,<sp/>count);</highlight></codeline>
</programlisting></para>
<para>The method <ref refid="classtf_1_1cudaGraphBase_1a32634c5645c14b99ceeaafe77ea5ea62" kindref="member">tf::cudaFlow::fill</ref> is a more powerful variant of <ref refid="classtf_1_1cudaGraphBase_1a10196f49de261a4042de328aab2452c8" kindref="member">tf::cudaFlow::memset</ref>. It can fill a memory area with any value of type <computeroutput>T</computeroutput>, given that <computeroutput>sizeof(T)</computeroutput> is 1, 2, or 4 bytes. The following example creates a GPU task to fill <computeroutput>count</computeroutput> elements in the array <computeroutput>target</computeroutput> with value <computeroutput>1234</computeroutput>.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="normal">cf.fill(target,<sp/>1234,<sp/>count);</highlight></codeline>
</programlisting></para>
<para>Similar concept applies to <ref refid="classtf_1_1cudaGraphBase_1a5e704c7bb669a82f4fe140ecb4576eb0" kindref="member">tf::cudaFlow::memcpy</ref> and <ref refid="classtf_1_1cudaGraphBase_1a02a041d5dd9e1e8958eb43e09331051e" kindref="member">tf::cudaFlow::copy</ref> as well. The following two methods are equivalent to each other.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="normal">cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a5e704c7bb669a82f4fe140ecb4576eb0" kindref="member">memcpy</ref>(target,<sp/>source,<sp/></highlight><highlight class="keyword">sizeof</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">)<sp/>*<sp/>count);</highlight></codeline>
<codeline><highlight class="normal">cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a02a041d5dd9e1e8958eb43e09331051e" kindref="member">copy</ref>(target,<sp/>source,<sp/>count);</highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1OffloadAcudaFlow">
<title>Offload a cudaFlow</title><para>To offload a cudaFlow to a GPU, you need to use tf::cudaFlow::run and pass a <ref refid="namespacetf_1af19c9b301dc0b0fe2a51a960fa427e83" kindref="member">tf::cudaStream</ref> created on that GPU. The run method is asynchronous and can be explicitly synchronized through the given stream.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="normal"><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>launch<sp/>a<sp/>cudaflow<sp/>asynchronously<sp/>through<sp/>a<sp/>stream</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaflow.run(stream);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>wait<sp/>for<sp/>the<sp/>cudaflow<sp/>to<sp/>finish</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
</programlisting></para>
<para>When you offload a cudaFlow using tf::cudaFlow::run, the runtime transforms that cudaFlow (i.e., application GPU task graph) into a native executable instance and submit it to the CUDA runtime for execution. There is always an one-to-one mapping between cudaFlow and its native CUDA graph representation (except those constructed by using <ref refid="classtf_1_1cudaFlowCapturer" kindref="compound">tf::cudaFlowCapturer</ref>).</para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1UpdateAcudaFlow">
<title>Update a cudaFlow</title><para>Many GPU applications require you to launch a cudaFlow multiple times and update node parameters (e.g., kernel parameters and memory addresses) between iterations. cudaFlow allows you to update the parameters of created tasks and run the updated cudaFlow with new parameters. Every task-creation method in <ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref> has an overload to update the parameters of a created task by that method.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="normal"><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref><sp/>cf;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>kernel<sp/>task</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaTask" kindref="compound">tf::cudaTask</ref><sp/>task<sp/>=<sp/>cf.<ref refid="classtf_1_1cudaGraphBase_1a1473a15a6023fbc25e1f029f2ff84aec" kindref="member">kernel</ref>(grid1,<sp/>block1,<sp/>shm1,<sp/>kernel,<sp/>kernel_args_1);</highlight></codeline>
<codeline><highlight class="normal">cf.run(stream);</highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>update<sp/>the<sp/>created<sp/>kernel<sp/>task<sp/>with<sp/>different<sp/>parameters</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cf.<ref refid="classtf_1_1cudaGraphBase_1a1473a15a6023fbc25e1f029f2ff84aec" kindref="member">kernel</ref>(task,<sp/>grid2,<sp/>block2,<sp/>shm2,<sp/>kernel,<sp/>kernel_args_2);</highlight></codeline>
<codeline><highlight class="normal">cf.run(stream);</highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
</programlisting></para>
<para>Between successive offloads (i.e., iterative executions of a cudaFlow), you can <emphasis>ONLY</emphasis> update task parameters, such as changing the kernel execution parameters and memory operation parameters. However, you must <emphasis>NOT</emphasis> change the topology of the cudaFlow, such as adding a new task or adding a new dependency. This is the limitation of CUDA Graph.</para>
<para><simplesect kind="attention"><para>There are a few restrictions on updating task parameters in a cudaFlow. Notably, you must <emphasis>NOT</emphasis> change the topology of an offloaded graph. In addition, update methods have the following limitations:<itemizedlist>
<listitem><para>kernel task<itemizedlist>
<listitem><para>The kernel function is not allowed to change. This restriction applies to all algorithm tasks that are created using lambda.</para>
</listitem></itemizedlist>
</para>
</listitem><listitem><para>memset and memcpy tasks:<itemizedlist>
<listitem><para>The CUDA device(s) to which the operand(s) was allocated/mapped cannot change</para>
</listitem><listitem><para>The source/destination memory must be allocated from the same contexts as the original source/destination memory.</para>
</listitem></itemizedlist>
</para>
</listitem></itemizedlist>
</para>
</simplesect>
</para>
</sect1>
<sect1 id="GPUTaskingcudaFlow_1IntegrateCudaFlowIntoTaskflow">
<title>Integrate a cudaFlow into Taskflow</title><para>You can create a task to enclose a cudaFlow and run it from a worker thread. The usage of the cudaFlow remains the same except that the cudaFlow is run by a worker thread from a taskflow task. The following example runs a cudaFlow from a static task:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="normal"><ref refid="classtf_1_1Executor" kindref="compound">tf::Executor</ref><sp/>executor;</highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1Taskflow" kindref="compound">tf::Taskflow</ref><sp/>taskflow;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">taskflow.<ref refid="classtf_1_1FlowBuilder_1a60d7a666cab71ecfa3010b2efb0d6b57" kindref="member">emplace</ref>([](){</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>cudaFlow<sp/>inside<sp/>a<sp/>static<sp/>task</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref><sp/>cudaflow;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>...<sp/>create<sp/>a<sp/>kernel<sp/>task</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>cudaflow.<ref refid="classtf_1_1cudaGraphBase_1a1473a15a6023fbc25e1f029f2ff84aec" kindref="member">kernel</ref>(...);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="comment">//<sp/>run<sp/>the<sp/>capturer<sp/>through<sp/>a<sp/>stream</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>capturer.<ref refid="classtf_1_1cudaFlowCapturer_1a952596fd7c46acee4c2459d8fe39da28" kindref="member">run</ref>(stream);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal">});</highlight></codeline>
</programlisting> </para>
</sect1>
</detaileddescription>
<location file="doxygen/cookbook/gpu_tasking_cudaflow.dox"/>
</compounddef>
</doxygen>