forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCUDASTDFind.xml
More file actions
176 lines (176 loc) · 24.2 KB
/
Copy pathCUDASTDFind.xml
File metadata and controls
176 lines (176 loc) · 24.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.12.0" xml:lang="en-US">
<compounddef id="CUDASTDFind" kind="page">
<compoundname>CUDASTDFind</compoundname>
<title>Parallel Find</title>
<tableofcontents>
<tocsect>
<name>Include the Header</name>
<reference>CUDASTDFind_1CUDASTDFindIncludeTheHeader</reference>
</tocsect>
<tocsect>
<name>Find an Element in a Range</name>
<reference>CUDASTDFind_1CUDASTDFindItems</reference>
</tocsect>
<tocsect>
<name>Find the Minimum Element in a Range</name>
<reference>CUDASTDFind_1CUDASTDFindMinItems</reference>
</tocsect>
<tocsect>
<name>Find the Maximum Element in a Range</name>
<reference>CUDASTDFind_1CUDASTDFindMaxItems</reference>
</tocsect>
</tableofcontents>
<briefdescription>
</briefdescription>
<detaileddescription>
<para>Taskflow provides standalone template methods for finding elements in the given ranges using GPU.</para>
<sect1 id="CUDASTDFind_1CUDASTDFindIncludeTheHeader">
<title>Include the Header</title><para>You need to include the header file, <computeroutput>taskflow/cuda/algorithm/find.hpp</computeroutput>, for using the parallel-find algorithm.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="find_8hpp" kindref="compound">taskflow/cuda/algorithm/find.hpp</ref>></highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="CUDASTDFind_1CUDASTDFindItems">
<title>Find an Element in a Range</title><para><ref refid="namespacetf_1a5f9dabd7c5d0fa5166cf76d9fa5a038e" kindref="member">tf::cuda_find_if</ref> finds the index of the first element in the range <computeroutput>[first, last)</computeroutput> that satisfies the given criteria. This is equivalent to the parallel execution of the following loop:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keywordtype">unsigned</highlight><highlight class="normal"><sp/>idx<sp/>=<sp/>0;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(;<sp/>first<sp/>!=<sp/>last;<sp/>++first,<sp/>++idx)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">if</highlight><highlight class="normal"><sp/>(p(*first))<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>idx;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>}</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>idx;</highlight></codeline>
</programlisting></para>
<para>If no such an element is found, the size of the range is returned. The following code finds the index of the first element that is dividable by <computeroutput>17</computeroutput> over a range of one million elements.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>N<sp/>=<sp/>1000000;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>vec<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);<sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>vector</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>idx<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<unsigned></ref>(1);<sp/><sp/></highlight><highlight class="comment">//<sp/>index</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>initializes<sp/>the<sp/>data</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>i=0;<sp/>i<N;<sp/>vec[i++]<sp/>=<sp/>rand());</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>create<sp/>an<sp/>execution<sp/>policy</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaExecutionPolicy" kindref="compound">tf::cudaDefaultExecutionPolicy</ref><sp/>policy;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>finds<sp/>the<sp/>index<sp/>of<sp/>the<sp/>first<sp/>element<sp/>that<sp/>is<sp/>a<sp/>multiple<sp/>of<sp/>17</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="namespacetf_1a5f9dabd7c5d0fa5166cf76d9fa5a038e" kindref="member">tf::cuda_find_if</ref>(</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>policy,<sp/>vec,<sp/>vec+N,<sp/>idx,<sp/>[]<sp/>__device__<sp/>(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>v)<sp/>{<sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>v%17<sp/>==<sp/>0;<sp/>}</highlight></codeline>
<codeline><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>wait<sp/>for<sp/>the<sp/>find<sp/>operation<sp/>to<sp/>complete</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">stream.synchronize();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>verifies<sp/>the<sp/>result</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">if</highlight><highlight class="normal">(*idx<sp/>!=<sp/>N)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>assert(vec[*idx]<sp/>%17<sp/>==<sp/>0);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>deletes<sp/>the<sp/>memory</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaFree(vec);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(idx);</highlight></codeline>
</programlisting></para>
<para>The find-if algorithm runs <emphasis>asynchronously</emphasis> through the stream specified in the execution policy. You need to synchronize the stream to obtain the correct result.</para>
</sect1>
<sect1 id="CUDASTDFind_1CUDASTDFindMinItems">
<title>Find the Minimum Element in a Range</title><para><ref refid="namespacetf_1a572c13198191c46765264f8afabe2e9f" kindref="member">tf::cuda_min_element</ref> finds the index of the minimum element in the given range <computeroutput>[first, last)</computeroutput> using the given comparison function object. This is equivalent to a parallel execution of the following loop:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keywordflow">if</highlight><highlight class="normal">(first<sp/>==<sp/>last)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>0;</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>smallest<sp/>=<sp/>first;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal"><sp/>(++first;<sp/>first<sp/>!=<sp/>last;<sp/>++first)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">if</highlight><highlight class="normal"><sp/>(op(*first,<sp/>*smallest))<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>smallest<sp/>=<sp/>first;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>}</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/><ref refid="cpp/iterator/distance" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::distance</ref>(first,<sp/>smallest);</highlight></codeline>
</programlisting></para>
<para>The following code finds the index of the minimum element in a range of one millions elements using GPU computing:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>N<sp/>=<sp/>1000000;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>vec<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);<sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>vector</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>idx<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<unsigned></ref>(1);<sp/><sp/></highlight><highlight class="comment">//<sp/>index</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>initializes<sp/>the<sp/>data</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>i=0;<sp/>i<N;<sp/>vec[i++]<sp/>=<sp/>rand());</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>create<sp/>an<sp/>execution<sp/>policy</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaExecutionPolicy" kindref="compound">tf::cudaDefaultExecutionPolicy</ref><sp/>policy(stream);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>queries<sp/>the<sp/>required<sp/>buffer<sp/>size<sp/>to<sp/>find<sp/>the<sp/>minimum<sp/>element<sp/>over<sp/>N<sp/>element</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>bytes<sp/><sp/>=<sp/>policy.<ref refid="classtf_1_1cudaExecutionPolicy_1abcafb001cd68c1135392f4bcda5a2a05" kindref="member">min_element_bufsz</ref><</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">>(N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>buffer<sp/>=<sp/><ref refid="namespacetf_1a2548e58af071bf1dbbbc945c84f237c9" kindref="member">tf::cuda_malloc_device<std::byte></ref>(bytes);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>finds<sp/>the<sp/>minimum<sp/>element<sp/>using<sp/>the<sp/>less<sp/>comparator</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="namespacetf_1a572c13198191c46765264f8afabe2e9f" kindref="member">tf::cuda_min_element</ref>(</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>policy,<sp/>vec,<sp/>vec+N,<sp/>idx,<sp/>[]<sp/>__device__<sp/>(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>a,<sp/></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>b)<sp/>{<sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>a<b;<sp/>},<sp/>buffer</highlight></codeline>
<codeline><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>wait<sp/>for<sp/>the<sp/>min-element<sp/>operation<sp/>completes</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>verifies<sp/>the<sp/>result</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">assert(vec[*idx]<sp/>==<sp/>*<ref refid="cpp/algorithm/min_element" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::min_element</ref>(vec,<sp/>vec+N,<sp/><ref refid="cpp/utility/functional/less" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::less<int></ref>{}));</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>deletes<sp/>the<sp/>memory</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaFree(vec);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(idx);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(buffer);</highlight></codeline>
</programlisting></para>
<para>Since the GPU min-element algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <computeroutput><ref refid="classtf_1_1cudaExecutionPolicy_1abcafb001cd68c1135392f4bcda5a2a05" kindref="member">tf::cudaDefaultExecutionPolicy::min_element_bufsz</ref></computeroutput>.</para>
<para><simplesect kind="attention"><para>You must keep the buffer alive before the <ref refid="namespacetf_1a572c13198191c46765264f8afabe2e9f" kindref="member">tf::cuda_min_element</ref> completes.</para>
</simplesect>
</para>
</sect1>
<sect1 id="CUDASTDFind_1CUDASTDFindMaxItems">
<title>Find the Maximum Element in a Range</title><para>Similar to <ref refid="namespacetf_1a572c13198191c46765264f8afabe2e9f" kindref="member">tf::cuda_min_element</ref>, <ref refid="namespacetf_1a3fc577fd0a8f127770bcf68bc56c073e" kindref="member">tf::cuda_max_element</ref> finds the index of the maximum element in the given range <computeroutput>[first, last)</computeroutput> using the given comparison function object. This is equivalent to a parallel execution of the following loop:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keywordflow">if</highlight><highlight class="normal">(first<sp/>==<sp/>last)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>0;</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>largest<sp/>=<sp/>first;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal"><sp/>(++first;<sp/>first<sp/>!=<sp/>last;<sp/>++first)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/></highlight><highlight class="keywordflow">if</highlight><highlight class="normal"><sp/>(op(*largest,<sp/>*first))<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/><sp/><sp/>largest<sp/>=<sp/>first;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>}</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/><ref refid="cpp/iterator/distance" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::distance</ref>(first,<sp/>largest);</highlight></codeline>
</programlisting></para>
<para>The following code finds the index of the maximum element in a range of one millions elements using GPU computing:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>N<sp/>=<sp/>1000000;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>vec<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);<sp/><sp/><sp/><sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>vector</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>idx<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<unsigned></ref>(1);<sp/><sp/></highlight><highlight class="comment">//<sp/>index</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>initializes<sp/>the<sp/>data</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>i=0;<sp/>i<N;<sp/>vec[i++]<sp/>=<sp/>rand());</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>create<sp/>an<sp/>execution<sp/>policy</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaExecutionPolicy" kindref="compound">tf::cudaDefaultExecutionPolicy</ref><sp/>policy(stream);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>queries<sp/>the<sp/>required<sp/>buffer<sp/>size<sp/>to<sp/>find<sp/>the<sp/>maximum<sp/>element<sp/>over<sp/>N<sp/>element</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>bytes<sp/><sp/>=<sp/>policy.<ref refid="classtf_1_1cudaExecutionPolicy_1a31fe75c4b0765df3035e12be49af88aa" kindref="member">max_element_bufsz</ref><</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">>(N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>buffer<sp/>=<sp/><ref refid="namespacetf_1a2548e58af071bf1dbbbc945c84f237c9" kindref="member">tf::cuda_malloc_device<std::byte></ref>(bytes);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>finds<sp/>the<sp/>maximum<sp/>element<sp/>using<sp/>the<sp/>less<sp/>comparator</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="namespacetf_1a3fc577fd0a8f127770bcf68bc56c073e" kindref="member">tf::cuda_max_element</ref>(</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>policy,<sp/>vec,<sp/>vec+N,<sp/>idx,<sp/>[]<sp/>__device__<sp/>(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>a,<sp/></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>b)<sp/>{<sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>a<b;<sp/>},<sp/>buffer</highlight></codeline>
<codeline><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>wait<sp/>for<sp/>the<sp/>max-element<sp/>operation<sp/>to<sp/>complete</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>verifies<sp/>the<sp/>result</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">assert(vec[*idx]<sp/>==<sp/>*<ref refid="cpp/algorithm/max_element" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::max_element</ref>(vec,<sp/>vec+N,<sp/><ref refid="cpp/utility/functional/less" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::less<int></ref>{}));</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>deletes<sp/>the<sp/>memory</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaFree(vec);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(idx);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(buffer);</highlight></codeline>
</programlisting></para>
<para>Since the GPU max-element algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <computeroutput><ref refid="classtf_1_1cudaExecutionPolicy_1a31fe75c4b0765df3035e12be49af88aa" kindref="member">tf::cudaDefaultExecutionPolicy::max_element_bufsz</ref></computeroutput>.</para>
<para><simplesect kind="attention"><para>You must keep the buffer alive before <ref refid="namespacetf_1a3fc577fd0a8f127770bcf68bc56c073e" kindref="member">tf::cuda_max_element</ref> completes. </para>
</simplesect>
</para>
</sect1>
</detaileddescription>
<location file="doxygen/cuda_std_algorithms/cuda_std_find.dox"/>
</compounddef>
</doxygen>