forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCompileTaskflowWithCUDA.xml
More file actions
160 lines (160 loc) · 20.5 KB
/
Copy pathCompileTaskflowWithCUDA.xml
File metadata and controls
160 lines (160 loc) · 20.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.12.0" xml:lang="en-US">
<compounddef id="CompileTaskflowWithCUDA" kind="page">
<compoundname>CompileTaskflowWithCUDA</compoundname>
<title>Compile Taskflow with CUDA</title>
<tableofcontents>
<tocsect>
<name>Install CUDA Compiler</name>
<reference>CompileTaskflowWithCUDA_1InstallCUDACompiler</reference>
</tocsect>
<tocsect>
<name>Compile Source Code Directly</name>
<reference>CompileTaskflowWithCUDA_1CompileTaskflowWithCUDADirectly</reference>
</tocsect>
<tocsect>
<name>Compile Source Code Separately</name>
<reference>CompileTaskflowWithCUDA_1CompileTaskflowWithCUDASeparately</reference>
<tableofcontents>
<tocsect>
<name>Link Objects Using nvcc</name>
<reference>CompileTaskflowWithCUDA_1CompileTaskflowWithCUDANaiveLinking</reference>
</tocsect>
<tocsect>
<name>Link Objects Using Different Linkers</name>
<reference>CompileTaskflowWithCUDA_1CompileTaskflowWithCUDADifferentLinkers</reference>
</tocsect>
</tableofcontents>
</tocsect>
</tableofcontents>
<briefdescription>
</briefdescription>
<detaileddescription>
<sect1 id="CompileTaskflowWithCUDA_1InstallCUDACompiler">
<title>Install CUDA Compiler</title><para>To compile Taskflow with CUDA code, you need a <computeroutput>nvcc</computeroutput> compiler. Please visit the official page of <ulink url="https://developer.nvidia.com/cuda-downloads">Downloading CUDA Toolkit</ulink>.</para>
</sect1>
<sect1 id="CompileTaskflowWithCUDA_1CompileTaskflowWithCUDADirectly">
<title>Compile Source Code Directly</title><para>Taskflow's GPU programming interface for CUDA is <ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref>. Consider the following <computeroutput>simple.cu</computeroutput> program that launches a single kernel function to output a message:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="taskflow_8hpp" kindref="compound">taskflow/taskflow.hpp</ref>></highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="preprocessor">#include<sp/><taskflow/cudaflow.hpp></highlight><highlight class="normal"><sp/><sp/></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="preprocessor">#include<sp/><taskflow/cuda/for_each.hpp></highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>main(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>argc,<sp/></highlight><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">char</highlight><highlight class="normal">**<sp/>argv)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Executor" kindref="compound">tf::Executor</ref><sp/>executor;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Taskflow" kindref="compound">tf::Taskflow</ref><sp/>taskflow;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Task" kindref="compound">tf::Task</ref><sp/>task1<sp/>=<sp/>taskflow.<ref refid="classtf_1_1FlowBuilder_1a60d7a666cab71ecfa3010b2efb0d6b57" kindref="member">emplace</ref>([](){}).name(</highlight><highlight class="stringliteral">"cpu<sp/>task"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Task" kindref="compound">tf::Task</ref><sp/>task2<sp/>=<sp/>taskflow.<ref refid="classtf_1_1FlowBuilder_1a60d7a666cab71ecfa3010b2efb0d6b57" kindref="member">emplace</ref>([](){</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>cudaFlow<sp/>of<sp/>a<sp/>single-threaded<sp/>task</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref><sp/>cf;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>cf.<ref refid="classtf_1_1cudaFlow_1ac2906cb0002fc411a983d100a3d58d62" kindref="member">single_task</ref>([]<sp/>__device__<sp/>()<sp/>{<sp/>printf(</highlight><highlight class="stringliteral">"hello<sp/>cudaFlow!\n"</highlight><highlight class="normal">);<sp/>});</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>launch<sp/>the<sp/>cudaflow<sp/>through<sp/>a<sp/>stream</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>cf.run(stream);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>}).name(</highlight><highlight class="stringliteral">"gpu<sp/>task"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>task1.<ref refid="classtf_1_1Task_1a8c78c453295a553c1c016e4062da8588" kindref="member">precede</ref>(task2);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>executor.<ref refid="classtf_1_1Executor_1a519777f5783981d534e9e53b99712069" kindref="member">run</ref>(taskflow).wait();</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>0;</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para>The easiest way to compile Taskflow with CUDA code (e.g., cudaFlow, kernels) is to use <ulink url="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html">nvcc</ulink>:</para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">~$<sp/>nvcc<sp/>-std=c++17<sp/>-I<sp/>path/to/taskflow/<sp/>--extended-lambda<sp/>simple.cu<sp/>-o<sp/>simple</highlight></codeline>
<codeline><highlight class="normal">~$<sp/>./simple</highlight></codeline>
<codeline><highlight class="normal">hello<sp/>cudaFlow!</highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="CompileTaskflowWithCUDA_1CompileTaskflowWithCUDASeparately">
<title>Compile Source Code Separately</title><para>Large GPU applications often compile a program into separate objects and link them together to form an executable or a library. You can compile your CPU code and GPU code separately with Taskflow using <computeroutput>nvcc</computeroutput> and other compilers (such as <computeroutput>g++</computeroutput> and <computeroutput>clang++</computeroutput>). Consider the following example that defines two tasks on two different pieces (<computeroutput>main.cpp</computeroutput> and <computeroutput>cudaflow.cpp</computeroutput>) of source code:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>main.cpp</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="preprocessor">#include<sp/><<ref refid="taskflow_8hpp" kindref="compound">taskflow/taskflow.hpp</ref>></highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1Task" kindref="compound">tf::Task</ref><sp/>make_cudaflow(<ref refid="classtf_1_1Taskflow" kindref="compound">tf::Taskflow</ref>&<sp/>taskflow);<sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>cudaFlow<sp/>task</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>main()<sp/>{</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Executor" kindref="compound">tf::Executor</ref><sp/>executor;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Taskflow" kindref="compound">tf::Taskflow</ref><sp/>taskflow;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Task" kindref="compound">tf::Task</ref><sp/>task1<sp/>=<sp/>taskflow.<ref refid="classtf_1_1FlowBuilder_1a60d7a666cab71ecfa3010b2efb0d6b57" kindref="member">emplace</ref>([](){<sp/><ref refid="cpp/io/basic_ostream" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::cout</ref><sp/><<<sp/></highlight><highlight class="stringliteral">"main.cpp!\n"</highlight><highlight class="normal">;<sp/>})</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/>.name(</highlight><highlight class="stringliteral">"cpu<sp/>task"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><ref refid="classtf_1_1Task" kindref="compound">tf::Task</ref><sp/>task2<sp/>=<sp/>make_cudaflow(taskflow);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>task1.<ref refid="classtf_1_1Task_1a8c78c453295a553c1c016e4062da8588" kindref="member">precede</ref>(task2);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>executor.<ref refid="classtf_1_1Executor_1a519777f5783981d534e9e53b99712069" kindref="member">run</ref>(taskflow).wait();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>0;</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>cudaflow.cpp</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="preprocessor">#include<sp/><<ref refid="taskflow_8hpp" kindref="compound">taskflow/taskflow.hpp</ref>></highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="preprocessor">#include<sp/><taskflow/cudaflow.hpp></highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1Task" kindref="compound">tf::Task</ref><sp/>make_cudaflow(<ref refid="classtf_1_1Taskflow" kindref="compound">tf::Taskflow</ref>&<sp/>taskflow)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>taskflow.<ref refid="classtf_1_1FlowBuilder_1a60d7a666cab71ecfa3010b2efb0d6b57" kindref="member">emplace</ref>([](){</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>create<sp/>a<sp/>cudaFlow<sp/>of<sp/>a<sp/>single-threaded<sp/>task</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref><sp/>cf;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>cf.<ref refid="classtf_1_1cudaFlow_1ac2906cb0002fc411a983d100a3d58d62" kindref="member">single_task</ref>([]<sp/>__device__<sp/>()<sp/>{<sp/>printf(</highlight><highlight class="stringliteral">"cudaflow.cpp!\n"</highlight><highlight class="normal">);<sp/>});</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>launch<sp/>the<sp/>cudaflow<sp/>through<sp/>a<sp/>stream</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>cf.run(stream);</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>}).name(</highlight><highlight class="stringliteral">"gpu<sp/>task"</highlight><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para>Compile each source to an object (<computeroutput>g++</computeroutput> as an example):</para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">~$<sp/>g++<sp/>-std=c++17<sp/>-I<sp/>path/to/taskflow<sp/>-c<sp/>main.cpp<sp/>-o<sp/>main.o</highlight></codeline>
<codeline><highlight class="normal">~$<sp/>nvcc<sp/>-std=c++17<sp/>--extended-lambda<sp/>-x<sp/>cu<sp/>-I<sp/>path/to/taskflow<sp/>\</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/>-dc<sp/>cudaflow.cpp<sp/>-o<sp/>cudaflow.o</highlight></codeline>
<codeline><highlight class="normal">~$<sp/>ls</highlight></codeline>
<codeline><highlight class="normal">#<sp/>now<sp/>we<sp/>have<sp/>the<sp/>two<sp/>compiled<sp/>.o<sp/>objects,<sp/>main.o<sp/>and<sp/>cudaflow.o</highlight></codeline>
<codeline><highlight class="normal">main.o<sp/>cudaflow.o<sp/></highlight></codeline>
</programlisting></para>
<para>The <computeroutput>--extended-lambda</computeroutput> option tells <computeroutput>nvcc</computeroutput> to generate GPU code for the lambda defined with <computeroutput><bold>device</bold></computeroutput>. The <computeroutput>-x cu</computeroutput> tells <computeroutput>nvcc</computeroutput> to treat the input files as <computeroutput></computeroutput>.cu files containing both CPU and GPU code. By default, <computeroutput>nvcc</computeroutput> treats <computeroutput></computeroutput>.cpp files as CPU-only code. This option is required to have <computeroutput>nvcc</computeroutput> generate device code here, but it is also a handy way to avoid renaming source files in larger projects. The <computeroutput>–dc</computeroutput> option tells <computeroutput>nvcc</computeroutput> to generate device code for later linking.</para>
<para>You may also need to specify the target architecture to tell <computeroutput>nvcc</computeroutput> to target on a compatible SM architecture using the option -arch. For instance, the following command requires device code linking to have compute capability 7.5 or later:</para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">~$<sp/>nvcc<sp/>-std=c++17<sp/>--extended-lambda<sp/>-x<sp/>cu<sp/>-arch=sm_75<sp/>-I<sp/>path/to/taskflow<sp/>\</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/><sp/><sp/><sp/><sp/>-dc<sp/>cudaflow.cpp<sp/>-o<sp/>cudaflow.o</highlight></codeline>
</programlisting></para>
<sect2 id="CompileTaskflowWithCUDA_1CompileTaskflowWithCUDANaiveLinking">
<title>Link Objects Using nvcc</title><para>Using <computeroutput>nvcc</computeroutput> to link compiled object code is nothing special but replacing the normal compiler with <computeroutput>nvcc</computeroutput> and it takes care of all the necessary steps:</para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">~$<sp/>nvcc<sp/>main.o<sp/>cudaflow.o<sp/>-o<sp/>main</highlight></codeline>
<codeline></codeline>
<codeline><highlight class="normal">#<sp/>run<sp/>the<sp/>main<sp/>program<sp/></highlight></codeline>
<codeline><highlight class="normal">~$<sp/>./main</highlight></codeline>
<codeline><highlight class="normal">main.cpp!</highlight></codeline>
<codeline><highlight class="normal">cudaflow.cpp!</highlight></codeline>
</programlisting></para>
</sect2>
<sect2 id="CompileTaskflowWithCUDA_1CompileTaskflowWithCUDADifferentLinkers">
<title>Link Objects Using Different Linkers</title><para>You can choose to use a compiler other than <computeroutput>nvcc</computeroutput> for the final link step. Since your CPU compiler does not know how to link CUDA device code, you have to add a step in your build to have <computeroutput>nvcc</computeroutput> link the CUDA device code, using the option <computeroutput>-dlink:</computeroutput> </para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">~$<sp/>nvcc<sp/>-o<sp/>gpuCode.o<sp/>-dlink<sp/>main.o<sp/>cudaflow.o</highlight></codeline>
</programlisting></para>
<para>This step links all the <emphasis>device object code</emphasis> and places it into <computeroutput>gpuCode.o</computeroutput>.</para>
<para><simplesect kind="attention"><para>Note that this step does not link the CPU object code and discards the CPU object code in <computeroutput>main.o</computeroutput> and <computeroutput>cudaflow.o</computeroutput>.</para>
</simplesect>
To complete the link to an executable, you can use, for example, <computeroutput>ld</computeroutput> or <computeroutput>g++</computeroutput>.</para>
<para><programlisting filename=".shell-session"><codeline><highlight class="normal">#<sp/>replace<sp/>/usr/local/cuda/lib64<sp/>with<sp/>your<sp/>own<sp/>CUDA<sp/>library<sp/>installation<sp/>path</highlight></codeline>
<codeline><highlight class="normal">~$<sp/>g++<sp/>-pthread<sp/>-L<sp/>/usr/local/cuda/lib64/<sp/>-lcudart<sp/>\</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/>gpuCode.o<sp/>main.o<sp/>cudaflow.o<sp/>-o<sp/>main</highlight></codeline>
<codeline></codeline>
<codeline><highlight class="normal">#<sp/>run<sp/>the<sp/>main<sp/>program</highlight></codeline>
<codeline><highlight class="normal">~$<sp/>./main</highlight></codeline>
<codeline><highlight class="normal">main.cpp!</highlight></codeline>
<codeline><highlight class="normal">cudaflow.cpp!</highlight></codeline>
</programlisting></para>
<para>We give <computeroutput>g++</computeroutput> all of the objects again because it needs the CPU object code, which is not in <computeroutput>gpuCode.o</computeroutput>. The device code stored in the original objects, <computeroutput>main.o</computeroutput> and <computeroutput>cudaflow.o</computeroutput>, does not conflict with the code in <computeroutput>gpuCode.o</computeroutput>. <computeroutput>g++</computeroutput> ignores device code because it does not know how to link it, and the device code in <computeroutput>gpuCode.o</computeroutput> is already linked and ready to go.</para>
<para><simplesect kind="attention"><para>This intentional ignorance is extremely useful in large builds where intermediate objects may have both CPU and GPU code. In this case, we just let the GPU and CPU linkers each do its own job, noting that the CPU linker is always the last one we run. The CUDA <ref refid="classtf_1_1Runtime" kindref="compound">Runtime</ref> API library is automatically linked when we use <computeroutput>nvcc</computeroutput> for linking, but we must explicitly link it (<computeroutput>-lcudart</computeroutput>) when using another linker. </para>
</simplesect>
</para>
</sect2>
</sect1>
</detaileddescription>
<location file="doxygen/install/cuda_compile.dox"/>
</compounddef>
</doxygen>