forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCUDASTDForEach.html
More file actions
138 lines (134 loc) · 14.6 KB
/
Copy pathCUDASTDForEach.html
File metadata and controls
138 lines (134 loc) · 14.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>CUDA Standard Algorithms » Parallel Iterations | Taskflow QuickStart</title>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,400i,600,600i%7CSource+Code+Pro:400,400i,600" />
<link rel="stylesheet" href="m-dark+documentation.compiled.css" />
<link rel="icon" href="favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="theme-color" content="#22272e" />
</head>
<body>
<header><nav id="navigation">
<div class="m-container">
<div class="m-row">
<span id="m-navbar-brand" class="m-col-t-8 m-col-m-none m-left-m">
<a href="https://taskflow.github.io"><img src="taskflow_logo.png" alt="" />Taskflow</a> <span class="m-breadcrumb">|</span> <a href="index.html" class="m-thin">QuickStart</a>
</span>
<div class="m-col-t-4 m-hide-m m-text-right m-nopadr">
<a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<path id="m-doc-search-icon-path" d="m6 0c-3.31 0-6 2.69-6 6 0 3.31 2.69 6 6 6 1.49 0 2.85-0.541 3.89-1.44-0.0164 0.338 0.147 0.759 0.5 1.15l3.22 3.79c0.552 0.614 1.45 0.665 2 0.115 0.55-0.55 0.499-1.45-0.115-2l-3.79-3.22c-0.392-0.353-0.812-0.515-1.15-0.5 0.895-1.05 1.44-2.41 1.44-3.89 0-3.31-2.69-6-6-6zm0 1.56a4.44 4.44 0 0 1 4.44 4.44 4.44 4.44 0 0 1-4.44 4.44 4.44 4.44 0 0 1-4.44-4.44 4.44 4.44 0 0 1 4.44-4.44z"/>
</svg></a>
<a id="m-navbar-show" href="#navigation" title="Show navigation"></a>
<a id="m-navbar-hide" href="#" title="Hide navigation"></a>
</div>
<div id="m-navbar-collapse" class="m-col-t-12 m-show-m m-col-m-none m-right-m">
<div class="m-row">
<ol class="m-col-t-6 m-col-m-none">
<li><a href="pages.html">Handbook</a></li>
<li><a href="namespaces.html">Namespaces</a></li>
</ol>
<ol class="m-col-t-6 m-col-m-none" start="3">
<li><a href="annotated.html">Classes</a></li>
<li><a href="files.html">Files</a></li>
<li class="m-show-m"><a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<use href="#m-doc-search-icon-path" />
</svg></a></li>
</ol>
</div>
</div>
</div>
</div>
</nav></header>
<main><article>
<div class="m-container m-container-inflatable">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<h1>
<span class="m-breadcrumb"><a href="cudaStandardAlgorithms.html">CUDA Standard Algorithms</a> »</span>
Parallel Iterations
</h1>
<nav class="m-block m-default">
<h3>Contents</h3>
<ul>
<li><a href="#CUDASTDParallelIterationIncludeTheHeader">Include the Header</a></li>
<li><a href="#CUDASTDIndexBasedParallelFor">Index-based Parallel Iterations</a></li>
<li><a href="#CUDASTDIteratorBasedParallelFor">Iterator-based Parallel Iterations</a></li>
</ul>
</nav>
<p>Taskflow provides standard template methods for performing parallel iterations over a range of items a CUDA GPU.</p><section id="CUDASTDParallelIterationIncludeTheHeader"><h2><a href="#CUDASTDParallelIterationIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/algorithm/for_each.hpp</code>, for using the parallel-iteration algorithm.</p><pre class="m-code"><span class="cp">#include</span><span class="w"> </span><span class="cpf"><taskflow/cuda/algorithm/for_each.hpp></span><span class="cp"></span></pre></section><section id="CUDASTDIndexBasedParallelFor"><h2><a href="#CUDASTDIndexBasedParallelFor">Index-based Parallel Iterations</a></h2><p>Index-based parallel-for performs parallel iterations over a range <code>[first, last)</code> with the given <code>step</code> size. The task created by <a href="namespacetf.html#a01ad7ce62fa6f42f2f2fbff3659b7884" class="m-doc">tf::<wbr />cuda_for_each_index</a> represents a kernel of parallel execution for the following loop:</p><pre class="m-code"><span class="c1">// positive step: first, first+step, first+2*step, ...</span>
<span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">+=</span><span class="n">step</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="n">callable</span><span class="p">(</span><span class="n">i</span><span class="p">);</span><span class="w"></span>
<span class="p">}</span><span class="w"></span>
<span class="c1">// negative step: first, first-step, first-2*step, ...</span>
<span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">></span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">+=</span><span class="n">step</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="n">callable</span><span class="p">(</span><span class="n">i</span><span class="p">);</span><span class="w"></span>
<span class="p">}</span><span class="w"></span></pre><p>Each iteration <code>i</code> is independent of each other and is assigned one kernel thread to run the callable. The following example creates a kernel that assigns each entry of <code>data</code> to 1 over the range [0, 100) with step size 1.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="n">policy</span><span class="p">;</span><span class="w"></span>
<span class="k">auto</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="mi">100</span><span class="p">);</span><span class="w"></span>
<span class="c1">// assigns each element in data to 1 over the range [0, 100) with step size 1</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_for_each_index</span><span class="p">(</span><span class="w"></span>
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="n">data</span><span class="p">]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">idx</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">data</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="p">);</span><span class="w"></span>
<span class="c1">// synchronize the execution</span>
<span class="n">policy</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span><span class="w"></span></pre><p>The parallel-iteration algorithm runs <em>asynchronously</em> through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results.</p></section><section id="CUDASTDIteratorBasedParallelFor"><h2><a href="#CUDASTDIteratorBasedParallelFor">Iterator-based Parallel Iterations</a></h2><p>Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, <code>first</code> and <code>last</code>. The task created by <a href="namespacetf.html#a7c449cec0b93503b8280d05add35e9f4" class="m-doc">tf::<wbr />cuda_for_each</a> represents a parallel execution of the following loop:</p><pre class="m-code"><span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="n">callable</span><span class="p">(</span><span class="o">*</span><span class="n">i</span><span class="p">);</span><span class="w"></span>
<span class="p">}</span><span class="w"></span></pre><p>The two iterators, <code>first</code> and <code>last</code>, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a <code>for_each</code> kernel that assigns each element in <code>gpu_data</code> to 1 over the range <code>[data, data + 1000)</code>.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="n">policy</span><span class="p">;</span><span class="w"></span>
<span class="k">auto</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="mi">1000</span><span class="p">);</span><span class="w"></span>
<span class="c1">// assigns each element in data to 1 over the range [0, 1000) with step size 1</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_for_each</span><span class="p">(</span><span class="w"></span>
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="o">&</span><span class="w"> </span><span class="n">item</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="p">);</span><span class="w"></span>
<span class="c1">// synchronize the execution</span>
<span class="n">policy</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span><span class="w"></span></pre><p>Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <code>__device__</code> specifier.</p></section>
</div>
</div>
</div>
</article></main>
<div class="m-doc-search" id="search">
<a href="#!" onclick="return hideSearch()"></a>
<div class="m-container">
<div class="m-row">
<div class="m-col-m-8 m-push-m-2">
<div class="m-doc-search-header m-text m-small">
<div><span class="m-label m-default">Tab</span> / <span class="m-label m-default">T</span> to search, <span class="m-label m-default">Esc</span> to close</div>
<div id="search-symbolcount">…</div>
</div>
<div class="m-doc-search-content">
<form>
<input type="search" name="q" id="search-input" placeholder="Loading …" disabled="disabled" autofocus="autofocus" autocomplete="off" spellcheck="false" />
</form>
<noscript class="m-text m-danger m-text-center">Unlike everything else in the docs, the search functionality <em>requires</em> JavaScript.</noscript>
<div id="search-help" class="m-text m-dim m-text-center">
<p class="m-noindent">Search for symbols, directories, files, pages or
modules. You can omit any prefix from the symbol or file path; adding a
<code>:</code> or <code>/</code> suffix lists all members of given symbol or
directory.</p>
<p class="m-noindent">Use <span class="m-label m-dim">↓</span>
/ <span class="m-label m-dim">↑</span> to navigate through the list,
<span class="m-label m-dim">Enter</span> to go.
<span class="m-label m-dim">Tab</span> autocompletes common prefix, you can
copy a link to the result using <span class="m-label m-dim">⌘</span>
<span class="m-label m-dim">L</span> while <span class="m-label m-dim">⌘</span>
<span class="m-label m-dim">M</span> produces a Markdown link.</p>
</div>
<div id="search-notfound" class="m-text m-warning m-text-center">Sorry, nothing was found.</div>
<ul id="search-results"></ul>
</div>
</div>
</div>
</div>
</div>
<script src="search-v2.js"></script>
<script src="searchdata-v2.js" async="async"></script>
<footer><nav>
<div class="m-container">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<p>Taskflow handbook is part of the <a href="https://taskflow.github.io">Taskflow project</a>, copyright © <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>, 2018–2023.<br />Generated by <a href="https://doxygen.org/">Doxygen</a> 1.8.14 and <a href="https://mcss.mosra.cz/">m.css</a>.</p>
</div>
</div>
</div>
</nav></footer>
</body>
</html>