forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCUDASTDFind.html
More file actions
219 lines (199 loc) · 30.5 KB
/
Copy pathCUDASTDFind.html
File metadata and controls
219 lines (199 loc) · 30.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>CUDA Standard Algorithms » Parallel Find | Taskflow QuickStart</title>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,400i,600,600i%7CSource+Code+Pro:400,400i,600" />
<link rel="stylesheet" href="m-dark+documentation.compiled.css" />
<link rel="icon" href="favicon.ico" type="image/x-icon" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="theme-color" content="#22272e" />
</head>
<body>
<header><nav id="navigation">
<div class="m-container">
<div class="m-row">
<span id="m-navbar-brand" class="m-col-t-8 m-col-m-none m-left-m">
<a href="https://taskflow.github.io"><img src="taskflow_logo.png" alt="" />Taskflow</a> <span class="m-breadcrumb">|</span> <a href="index.html" class="m-thin">QuickStart</a>
</span>
<div class="m-col-t-4 m-hide-m m-text-right m-nopadr">
<a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<path id="m-doc-search-icon-path" d="m6 0c-3.31 0-6 2.69-6 6 0 3.31 2.69 6 6 6 1.49 0 2.85-0.541 3.89-1.44-0.0164 0.338 0.147 0.759 0.5 1.15l3.22 3.79c0.552 0.614 1.45 0.665 2 0.115 0.55-0.55 0.499-1.45-0.115-2l-3.79-3.22c-0.392-0.353-0.812-0.515-1.15-0.5 0.895-1.05 1.44-2.41 1.44-3.89 0-3.31-2.69-6-6-6zm0 1.56a4.44 4.44 0 0 1 4.44 4.44 4.44 4.44 0 0 1-4.44 4.44 4.44 4.44 0 0 1-4.44-4.44 4.44 4.44 0 0 1 4.44-4.44z"/>
</svg></a>
<a id="m-navbar-show" href="#navigation" title="Show navigation"></a>
<a id="m-navbar-hide" href="#" title="Hide navigation"></a>
</div>
<div id="m-navbar-collapse" class="m-col-t-12 m-show-m m-col-m-none m-right-m">
<div class="m-row">
<ol class="m-col-t-6 m-col-m-none">
<li><a href="pages.html">Handbook</a></li>
<li><a href="namespaces.html">Namespaces</a></li>
</ol>
<ol class="m-col-t-6 m-col-m-none" start="3">
<li><a href="annotated.html">Classes</a></li>
<li><a href="files.html">Files</a></li>
<li class="m-show-m"><a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<use href="#m-doc-search-icon-path" />
</svg></a></li>
</ol>
</div>
</div>
</div>
</div>
</nav></header>
<main><article>
<div class="m-container m-container-inflatable">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<h1>
<span class="m-breadcrumb"><a href="cudaStandardAlgorithms.html">CUDA Standard Algorithms</a> »</span>
Parallel Find
</h1>
<nav class="m-block m-default">
<h3>Contents</h3>
<ul>
<li><a href="#CUDASTDFindIncludeTheHeader">Include the Header</a></li>
<li><a href="#CUDASTDFindItems">Find an Element in a Range</a></li>
<li><a href="#CUDASTDFindMinItems">Find the Minimum Element in a Range</a></li>
<li><a href="#CUDASTDFindMaxItems">Find the Maximum Element in a Range</a></li>
</ul>
</nav>
<p>Taskflow provides standalone template methods for finding elements in the given ranges using GPU.</p><section id="CUDASTDFindIncludeTheHeader"><h2><a href="#CUDASTDFindIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/algorithm/find.hpp</code>, for using the parallel-find algorithm.</p><pre class="m-code"><span class="cp">#include</span><span class="w"> </span><span class="cpf"><taskflow/cuda/algorithm/find.hpp></span></pre></section><section id="CUDASTDFindItems"><h2><a href="#CUDASTDFindItems">Find an Element in a Range</a></h2><p><a href="namespacetf.html#a5f9dabd7c5d0fa5166cf76d9fa5a038e" class="m-doc">tf::<wbr />cuda_find_if</a> finds the index of the first element in the range <code>[first, last)</code> that satisfies the given criteria. This is equivalent to the parallel execution of the following loop:</p><pre class="m-code"><span class="kt">unsigned</span><span class="w"> </span><span class="n">idx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="k">for</span><span class="p">(;</span><span class="w"> </span><span class="n">first</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">first</span><span class="p">,</span><span class="w"> </span><span class="o">++</span><span class="n">idx</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">p</span><span class="p">(</span><span class="o">*</span><span class="n">first</span><span class="p">))</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">idx</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span><span class="w"> </span><span class="n">idx</span><span class="p">;</span></pre><p>If no such an element is found, the size of the range is returned. The following code finds the index of the first element that is dividable by <code>17</code> over a range of one million elements.</p><pre class="m-code"><span class="k">const</span><span class="w"> </span><span class="kt">size_t</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000</span><span class="p">;</span>
<span class="k">auto</span><span class="w"> </span><span class="n">vec</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w"> </span><span class="c1">// vector</span>
<span class="k">auto</span><span class="w"> </span><span class="n">idx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">unsigned</span><span class="o">></span><span class="p">(</span><span class="mi">1</span><span class="p">);</span><span class="w"> </span><span class="c1">// index</span>
<span class="c1">// initializes the data</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">vec</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rand</span><span class="p">());</span>
<span class="c1">// create an execution policy</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="n">policy</span><span class="p">;</span>
<span class="c1">// finds the index of the first element that is a multiple of 17</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_find_if</span><span class="p">(</span>
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="o">+</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">idx</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">v</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">v</span><span class="o">%</span><span class="mi">17</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
<span class="p">);</span>
<span class="c1">// wait for the find operation to complete</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="c1">// verifies the result</span>
<span class="k">if</span><span class="p">(</span><span class="o">*</span><span class="n">idx</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">assert</span><span class="p">(</span><span class="n">vec</span><span class="p">[</span><span class="o">*</span><span class="n">idx</span><span class="p">]</span><span class="w"> </span><span class="o">%</span><span class="mi">17</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// deletes the memory</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">vec</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span></pre><p>The find-if algorithm runs <em>asynchronously</em> through the stream specified in the execution policy. You need to synchronize the stream to obtain the correct result.</p></section><section id="CUDASTDFindMinItems"><h2><a href="#CUDASTDFindMinItems">Find the Minimum Element in a Range</a></h2><p><a href="namespacetf.html#a572c13198191c46765264f8afabe2e9f" class="m-doc">tf::<wbr />cuda_min_element</a> finds the index of the minimum element in the given range <code>[first, last)</code> using the given comparison function object. This is equivalent to a parallel execution of the following loop:</p><pre class="m-code"><span class="k">if</span><span class="p">(</span><span class="n">first</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">last</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">auto</span><span class="w"> </span><span class="n">smallest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">;</span>
<span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="o">++</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">first</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">first</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">op</span><span class="p">(</span><span class="o">*</span><span class="n">first</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="n">smallest</span><span class="p">))</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">smallest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">distance</span><span class="p">(</span><span class="n">first</span><span class="p">,</span><span class="w"> </span><span class="n">smallest</span><span class="p">);</span></pre><p>The following code finds the index of the minimum element in a range of one millions elements using GPU computing:</p><pre class="m-code"><span class="k">const</span><span class="w"> </span><span class="kt">size_t</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000</span><span class="p">;</span>
<span class="k">auto</span><span class="w"> </span><span class="n">vec</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w"> </span><span class="c1">// vector</span>
<span class="k">auto</span><span class="w"> </span><span class="n">idx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">unsigned</span><span class="o">></span><span class="p">(</span><span class="mi">1</span><span class="p">);</span><span class="w"> </span><span class="c1">// index</span>
<span class="c1">// initializes the data</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">vec</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rand</span><span class="p">());</span>
<span class="c1">// create an execution policy</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="nf">policy</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="c1">// queries the required buffer size to find the minimum element over N element</span>
<span class="k">auto</span><span class="w"> </span><span class="n">bytes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">policy</span><span class="p">.</span><span class="n">min_element_bufsz</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">N</span><span class="p">);</span>
<span class="k">auto</span><span class="w"> </span><span class="n">buffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_device</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">byte</span><span class="o">></span><span class="p">(</span><span class="n">bytes</span><span class="p">);</span>
<span class="c1">// finds the minimum element using the less comparator</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_min_element</span><span class="p">(</span>
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="o">+</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">idx</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="k">auto</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="o"><</span><span class="n">b</span><span class="p">;</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">buffer</span>
<span class="p">);</span>
<span class="c1">// wait for the min-element operation completes</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="c1">// verifies the result</span>
<span class="n">assert</span><span class="p">(</span><span class="n">vec</span><span class="p">[</span><span class="o">*</span><span class="n">idx</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="o">*</span><span class="n">std</span><span class="o">::</span><span class="n">min_element</span><span class="p">(</span><span class="n">vec</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="o">+</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">less</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">{}));</span>
<span class="c1">// deletes the memory</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">vec</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span></pre><p>Since the GPU min-element algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <code><a href="classtf_1_1cudaExecutionPolicy.html#abcafb001cd68c1135392f4bcda5a2a05" class="m-doc">tf::<wbr />cudaDefaultExecutionPolicy::<wbr />min_element_bufsz</a></code>.</p><aside class="m-note m-warning"><h4>Attention</h4><p>You must keep the buffer alive before the <a href="namespacetf.html#a572c13198191c46765264f8afabe2e9f" class="m-doc">tf::<wbr />cuda_min_element</a> completes.</p></aside></section><section id="CUDASTDFindMaxItems"><h2><a href="#CUDASTDFindMaxItems">Find the Maximum Element in a Range</a></h2><p>Similar to <a href="namespacetf.html#a572c13198191c46765264f8afabe2e9f" class="m-doc">tf::<wbr />cuda_min_element</a>, <a href="namespacetf.html#a3fc577fd0a8f127770bcf68bc56c073e" class="m-doc">tf::<wbr />cuda_max_element</a> finds the index of the maximum element in the given range <code>[first, last)</code> using the given comparison function object. This is equivalent to a parallel execution of the following loop:</p><pre class="m-code"><span class="k">if</span><span class="p">(</span><span class="n">first</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">last</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">auto</span><span class="w"> </span><span class="n">largest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">;</span>
<span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="o">++</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">first</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">first</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">op</span><span class="p">(</span><span class="o">*</span><span class="n">largest</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="n">first</span><span class="p">))</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">largest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">distance</span><span class="p">(</span><span class="n">first</span><span class="p">,</span><span class="w"> </span><span class="n">largest</span><span class="p">);</span></pre><p>The following code finds the index of the maximum element in a range of one millions elements using GPU computing:</p><pre class="m-code"><span class="k">const</span><span class="w"> </span><span class="kt">size_t</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000</span><span class="p">;</span>
<span class="k">auto</span><span class="w"> </span><span class="n">vec</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w"> </span><span class="c1">// vector</span>
<span class="k">auto</span><span class="w"> </span><span class="n">idx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">unsigned</span><span class="o">></span><span class="p">(</span><span class="mi">1</span><span class="p">);</span><span class="w"> </span><span class="c1">// index</span>
<span class="c1">// initializes the data</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">vec</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rand</span><span class="p">());</span>
<span class="c1">// create an execution policy</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="nf">policy</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="c1">// queries the required buffer size to find the maximum element over N element</span>
<span class="k">auto</span><span class="w"> </span><span class="n">bytes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">policy</span><span class="p">.</span><span class="n">max_element_bufsz</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">N</span><span class="p">);</span>
<span class="k">auto</span><span class="w"> </span><span class="n">buffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_device</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">byte</span><span class="o">></span><span class="p">(</span><span class="n">bytes</span><span class="p">);</span>
<span class="c1">// finds the maximum element using the less comparator</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_max_element</span><span class="p">(</span>
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="o">+</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">idx</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="k">auto</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="o"><</span><span class="n">b</span><span class="p">;</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">buffer</span>
<span class="p">);</span>
<span class="c1">// wait for the max-element operation to complete</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="c1">// verifies the result</span>
<span class="n">assert</span><span class="p">(</span><span class="n">vec</span><span class="p">[</span><span class="o">*</span><span class="n">idx</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="o">*</span><span class="n">std</span><span class="o">::</span><span class="n">max_element</span><span class="p">(</span><span class="n">vec</span><span class="p">,</span><span class="w"> </span><span class="n">vec</span><span class="o">+</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">less</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">{}));</span>
<span class="c1">// deletes the memory</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">vec</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span></pre><p>Since the GPU max-element algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <code><a href="classtf_1_1cudaExecutionPolicy.html#a31fe75c4b0765df3035e12be49af88aa" class="m-doc">tf::<wbr />cudaDefaultExecutionPolicy::<wbr />max_element_bufsz</a></code>.</p><aside class="m-note m-warning"><h4>Attention</h4><p>You must keep the buffer alive before <a href="namespacetf.html#a3fc577fd0a8f127770bcf68bc56c073e" class="m-doc">tf::<wbr />cuda_max_element</a> completes.</p></aside></section>
</div>
</div>
</div>
</article></main>
<div class="m-doc-search" id="search">
<a href="#!" onclick="return hideSearch()"></a>
<div class="m-container">
<div class="m-row">
<div class="m-col-m-8 m-push-m-2">
<div class="m-doc-search-header m-text m-small">
<div><span class="m-label m-default">Tab</span> / <span class="m-label m-default">T</span> to search, <span class="m-label m-default">Esc</span> to close</div>
<div id="search-symbolcount">…</div>
</div>
<div class="m-doc-search-content">
<form>
<input type="search" name="q" id="search-input" placeholder="Loading …" disabled="disabled" autofocus="autofocus" autocomplete="off" spellcheck="false" />
</form>
<noscript class="m-text m-danger m-text-center">Unlike everything else in the docs, the search functionality <em>requires</em> JavaScript.</noscript>
<div id="search-help" class="m-text m-dim m-text-center">
<p class="m-noindent">Search for symbols, directories, files, pages or
modules. You can omit any prefix from the symbol or file path; adding a
<code>:</code> or <code>/</code> suffix lists all members of given symbol or
directory.</p>
<p class="m-noindent">Use <span class="m-label m-dim">↓</span>
/ <span class="m-label m-dim">↑</span> to navigate through the list,
<span class="m-label m-dim">Enter</span> to go.
<span class="m-label m-dim">Tab</span> autocompletes common prefix, you can
copy a link to the result using <span class="m-label m-dim">⌘</span>
<span class="m-label m-dim">L</span> while <span class="m-label m-dim">⌘</span>
<span class="m-label m-dim">M</span> produces a Markdown link.</p>
</div>
<div id="search-notfound" class="m-text m-warning m-text-center">Sorry, nothing was found.</div>
<ul id="search-results"></ul>
</div>
</div>
</div>
</div>
</div>
<script src="search-v2.js"></script>
<script src="searchdata-v2.js" async="async"></script>
<footer><nav>
<div class="m-container">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<p>Taskflow handbook is part of the <a href="https://taskflow.github.io">Taskflow project</a>, copyright © <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>, 2018–2025.<br />Generated by <a href="https://doxygen.org/">Doxygen</a> 1.12.0 and <a href="https://mcss.mosra.cz/">m.css</a>.</p>
</div>
</div>
</div>
</nav></footer>
</body>
</html>