forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathForEachCUDA.xml
More file actions
44 lines (44 loc) · 7.17 KB
/
Copy pathForEachCUDA.xml
File metadata and controls
44 lines (44 loc) · 7.17 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.8.14">
<compounddef id="ForEachCUDA" kind="page">
<compoundname>ForEachCUDA</compoundname>
<title>Parallel Iterations</title>
<tableofcontents/>
<briefdescription>
</briefdescription>
<detaileddescription>
<para><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref> provides two template methods, <ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">tf::cudaFlow::for_each</ref> and <ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">tf::cudaFlow::for_each_index</ref>, for creating tasks to perform parallel iterations over a range of items.</para><sect1 id="ForEachCUDA_1CUDAForEachIncludeTheHeader">
<title>Include the Header</title>
<para>You need to include the header file, <computeroutput>taskflow/cuda/algorithm/for_each.hpp</computeroutput>, for creating a parallel-iteration task.</para><para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="for__each_8hpp" kindref="compound">taskflow/cuda/algorithm/for_each.hpp</ref>></highlight></codeline>
</programlisting></para></sect1>
<sect1 id="ForEachCUDA_1ForEachCUDAIndexBasedParallelFor">
<title>Index-based Parallel Iterations</title>
<para>Index-based parallel-for performs parallel iterations over a range <computeroutput>[first, last)</computeroutput> with the given <computeroutput>step</computeroutput> size. The task created by <ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">tf::cudaFlow::for_each_index(I first, I last, I step, C callable)</ref> represents a kernel of parallel execution for the following loop:</para><para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>positive<sp/>step:<sp/>first,<sp/>first+step,<sp/>first+2*step,<sp/>...</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i<last;<sp/>i+=step)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>callable(i);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>negative<sp/>step:<sp/>first,<sp/>first-step,<sp/>first-2*step,<sp/>...</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i>last;<sp/>i+=step)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>callable(i);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para><para>Each iteration <computeroutput>i</computeroutput> is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <computeroutput>__device__</computeroutput> specifier. The following example creates a kernel that assigns each entry of <computeroutput>gpu_data</computeroutput> to 1 over the range <computeroutput></computeroutput>[0, 100) with step size 1.</para><para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>assigns<sp/>each<sp/>element<sp/>in<sp/>gpu_data<sp/>to<sp/>1<sp/>over<sp/>the<sp/>range<sp/>[0,<sp/>100)<sp/>with<sp/>step<sp/>size<sp/>1</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaflow.<ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">for_each_index</ref>(0,<sp/>100,<sp/>1,<sp/>[gpu_data]<sp/>__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>idx)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>gpu_data[idx]<sp/>=<sp/>1;</highlight></codeline>
<codeline><highlight class="normal">});</highlight></codeline>
</programlisting></para></sect1>
<sect1 id="ForEachCUDA_1ForEachCUDAIteratorBasedParallelIterations">
<title>Iterator-based Parallel Iterations</title>
<para>Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, <computeroutput>first</computeroutput> and <computeroutput>last</computeroutput>. The task created by <ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">tf::cudaFlow::for_each(I first, I last, C callable)</ref> represents a parallel execution of the following loop:</para><para><programlisting filename=".cpp"><codeline><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i<last;<sp/>i++)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>callable(*i);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para><para>The two iterators, <computeroutput>first</computeroutput> and <computeroutput>last</computeroutput>, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a <computeroutput>for_each</computeroutput> kernel that assigns each element in <computeroutput>gpu_data</computeroutput> to 1 over the range <computeroutput>[gpu_data, gpu_data + 1000)</computeroutput>.</para><para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>assigns<sp/>each<sp/>element<sp/>to<sp/>1<sp/>over<sp/>the<sp/>range<sp/>[gpu_data,<sp/>gpu_data<sp/>+<sp/>1000)</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaflow.<ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">for_each</ref>(gpu_data,<sp/>gpu_data<sp/>+<sp/>1000,<sp/>[]<sp/>__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">&<sp/>item)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>item<sp/>=<sp/>1;</highlight></codeline>
<codeline><highlight class="normal">});<sp/></highlight></codeline>
</programlisting></para><para>Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <computeroutput>__device__</computeroutput> specifier.</para></sect1>
<sect1 id="ForEachCUDA_1ForEachCUDAMiscellaneousItems">
<title>Miscellaneous Items</title>
<para>The parallel-iteration algorithms are also available in <ref refid="classtf_1_1cudaFlowCapturer_1a0b2f1bcd59f0b42e0f823818348b4ae7" kindref="member">tf::cudaFlowCapturer::for_each</ref> and <ref refid="classtf_1_1cudaFlowCapturer_1aeb877f42ee3a627c40f1c9c84e31ba3c" kindref="member">tf::cudaFlowCapturer::for_each_index</ref>. </para></sect1>
</detaileddescription>
</compounddef>
</doxygen>