forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCUDASTDMerge.xml
More file actions
130 lines (130 loc) · 17.6 KB
/
Copy pathCUDASTDMerge.xml
File metadata and controls
130 lines (130 loc) · 17.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.12.0" xml:lang="en-US">
<compounddef id="CUDASTDMerge" kind="page">
<compoundname>CUDASTDMerge</compoundname>
<title>Parallel Merge</title>
<tableofcontents>
<tocsect>
<name>Include the Header</name>
<reference>CUDASTDMerge_1CUDASTDMergeIncludeTheHeader</reference>
</tocsect>
<tocsect>
<name>Merge Two Sorted Ranges of Items</name>
<reference>CUDASTDMerge_1CUDASTDMergeItems</reference>
</tocsect>
<tocsect>
<name>Merge Two Sorted Ranges of Key-Value Items</name>
<reference>CUDASTDMerge_1CUDASTDMergeKeyValueItems</reference>
</tocsect>
</tableofcontents>
<briefdescription>
</briefdescription>
<detaileddescription>
<para>Taskflow provides standalone template methods for merging two sorted ranges of items into a sorted range of items.</para>
<sect1 id="CUDASTDMerge_1CUDASTDMergeIncludeTheHeader">
<title>Include the Header</title><para>You need to include the header file, <computeroutput>taskflow/cuda/algorithm/merge.hpp</computeroutput>, for using the parallel-merge algorithm.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="merge_8hpp" kindref="compound">taskflow/cuda/algorithm/merge.hpp</ref>></highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="CUDASTDMerge_1CUDASTDMergeItems">
<title>Merge Two Sorted Ranges of Items</title><para><ref refid="namespacetf_1a37ec481149c2f01669353033d75ed72a" kindref="member">tf::cuda_merge</ref> merges two sorted ranges of items into a sorted range. The following code merges two sorted arrays <computeroutput>input_1</computeroutput> and <computeroutput>input_2</computeroutput>, each of 1000 items, into a sorted array <computeroutput>output</computeroutput> of 2000 items.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>N<sp/>=<sp/>1000;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>input_1<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);<sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>input<sp/>vector<sp/>1</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>input_2<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);<sp/><sp/><sp/><sp/></highlight><highlight class="comment">//<sp/>input<sp/>vector<sp/>2</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>output<sp/><sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(2*N);<sp/><sp/></highlight><highlight class="comment">//<sp/>output<sp/>vector</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>initializes<sp/>the<sp/>data</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>i=0;<sp/>i<N;<sp/>i++)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>input_1[i]<sp/>=<sp/>rand()%100;</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>input_2[i]<sp/>=<sp/>rand()%100;</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"><ref refid="cpp/algorithm/sort" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::sort</ref>(input_1,<sp/>input1<sp/>+<sp/>N);</highlight></codeline>
<codeline><highlight class="normal"><ref refid="cpp/algorithm/sort" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::sort</ref>(input_2,<sp/>input2<sp/>+<sp/>N);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>create<sp/>an<sp/>execution<sp/>policy</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaExecutionPolicy" kindref="compound">tf::cudaDefaultExecutionPolicy</ref><sp/>policy(stream);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>queries<sp/>the<sp/>required<sp/>buffer<sp/>size<sp/>to<sp/>merge<sp/>two<sp/>N-element<sp/>sorted<sp/>vectors</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>bytes<sp/><sp/>=<sp/>policy.merge_bufsz(N,<sp/>N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>buffer<sp/>=<sp/><ref refid="namespacetf_1a2548e58af071bf1dbbbc945c84f237c9" kindref="member">tf::cuda_malloc_device<std::byte></ref>(bytes);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>merge<sp/>input_1<sp/>and<sp/>input_2<sp/>to<sp/>output</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="namespacetf_1a37ec481149c2f01669353033d75ed72a" kindref="member">tf::cuda_merge</ref>(policy,<sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>input_1,<sp/>input_1<sp/>+<sp/>N,<sp/>input_2,<sp/>input_2<sp/>+<sp/>N,<sp/>output,<sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>[]__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>a,<sp/></highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>b)<sp/>{<sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>a<sp/><<sp/>b;<sp/>},<sp/><sp/></highlight><highlight class="comment">//<sp/>comparator</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>buffer</highlight></codeline>
<codeline><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>synchronizes<sp/>the<sp/>execution<sp/>and<sp/>verifies<sp/>the<sp/>result</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>verify<sp/>the<sp/>result</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">assert(<ref refid="cpp/algorithm/is_sorted" kindref="compound" external="/Users/twhuang/Code/taskflow/doxygen/cppreference-doxygen-web.tag.xml">std::is_sorted</ref>(output,<sp/>output<sp/>+<sp/>2*N));</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>delete<sp/>the<sp/>buffer</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaFree(input1);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(input2);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(output);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(buffer);</highlight></codeline>
</programlisting></para>
<para>The merge algorithm runs <emphasis>asynchronously</emphasis> through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU merge algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <computeroutput><ref refid="classtf_1_1cudaExecutionPolicy_1a1febbe549d9cbe4502a5b66167ab9553" kindref="member">tf::cudaDefaultExecutionPolicy::merge_bufsz</ref></computeroutput>. The buffer size depends only on the two input vector sizes.</para>
<para><simplesect kind="attention"><para>You must keep the buffer alive before the merge call completes.</para>
</simplesect>
</para>
</sect1>
<sect1 id="CUDASTDMerge_1CUDASTDMergeKeyValueItems">
<title>Merge Two Sorted Ranges of Key-Value Items</title><para><ref refid="namespacetf_1aa84d4c68d2cbe9f6efc4a1eb1a115458" kindref="member">tf::cuda_merge_by_key</ref> performs key-value merge over two sorted ranges in a similar way to <ref refid="namespacetf_1a37ec481149c2f01669353033d75ed72a" kindref="member">tf::cuda_merge</ref>; additionally, it copies elements from the two ranges of values associated with the two input keys, respectively. The following code performs key-value merge over <computeroutput>a</computeroutput> and <computeroutput>b:</computeroutput> </para>
<para><programlisting filename=".cpp"><codeline><highlight class="keyword">const</highlight><highlight class="normal"><sp/></highlight><highlight class="keywordtype">size_t</highlight><highlight class="normal"><sp/>N<sp/>=<sp/>2;</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>a_keys<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>a_vals<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>b_keys<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>b_vals<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>c_keys<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(2*N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordtype">int</highlight><highlight class="normal">*<sp/>c_vals<sp/>=<sp/><ref refid="namespacetf_1ad289846c38e3f122e1315d906243fc8b" kindref="member">tf::cuda_malloc_shared<int></ref>(2*N);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>initializes<sp/>the<sp/>data</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">a_keys[0]<sp/>=<sp/>8,<sp/>a_keys[1]<sp/>=<sp/>1;</highlight></codeline>
<codeline><highlight class="normal">a_vals[0]<sp/>=<sp/>1,<sp/>a_vals[1]<sp/>=<sp/>2;</highlight></codeline>
<codeline><highlight class="normal">b_keys[0]<sp/>=<sp/>3,<sp/>b_keys[1]<sp/>=<sp/>7;</highlight></codeline>
<codeline><highlight class="normal">b_vals[0]<sp/>=<sp/>3,<sp/>b_vals[1]<sp/>=<sp/>4;</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>create<sp/>an<sp/>execution<sp/>policy</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaStreamBase" kindref="compound">tf::cudaStream</ref><sp/>stream;</highlight></codeline>
<codeline><highlight class="normal"><ref refid="classtf_1_1cudaExecutionPolicy" kindref="compound">tf::cudaDefaultExecutionPolicy</ref><sp/>policy(stream);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>queries<sp/>the<sp/>required<sp/>buffer<sp/>size<sp/>to<sp/>merge<sp/>two<sp/>N-element<sp/>sorted<sp/>vectors<sp/>by<sp/>keys</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>bytes<sp/><sp/>=<sp/>policy.merge_bufsz(N,<sp/>N);</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>buffer<sp/>=<sp/><ref refid="namespacetf_1a2548e58af071bf1dbbbc945c84f237c9" kindref="member">tf::cuda_malloc_device<std::byte></ref>(bytes);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>merge<sp/>keys<sp/>and<sp/>values<sp/>of<sp/>a<sp/>and<sp/>b<sp/>to<sp/>c</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><ref refid="namespacetf_1aa84d4c68d2cbe9f6efc4a1eb1a115458" kindref="member">tf::cuda_merge_by_key</ref>(</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>policy,<sp/></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>a_keys,<sp/>a_keys+N,<sp/>a_vals,</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>b_keys,<sp/>b_keys+N,<sp/>b_vals,</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>c_keys,<sp/>c_vals,</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>[]__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>a,<sp/></highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>b)<sp/>{<sp/></highlight><highlight class="keywordflow">return</highlight><highlight class="normal"><sp/>a<sp/><<sp/>b;<sp/>},<sp/><sp/></highlight><highlight class="comment">//<sp/>comparator</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>buffer</highlight></codeline>
<codeline><highlight class="normal">);</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>wait<sp/>for<sp/>the<sp/>merge<sp/>to<sp/>complete</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">stream.<ref refid="classtf_1_1cudaStreamBase_1a08857ff2874cd5378e578822e2e96dd0" kindref="member">synchronize</ref>();</highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>now,<sp/>c_keys<sp/>=<sp/>{1,<sp/>3,<sp/>7,<sp/>8}</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>now,<sp/>c_vals<sp/>=<sp/>{2,<sp/>3,<sp/>4,<sp/>1}</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>delete<sp/>the<sp/>device<sp/>memory</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaFree(buffer);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(a_keys);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(b_keys);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(c_keys);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(a_vals);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(b_vals);</highlight></codeline>
<codeline><highlight class="normal">cudaFree(c_vals);</highlight></codeline>
</programlisting></para>
<para>Since the GPU merge algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <computeroutput><ref refid="classtf_1_1cudaExecutionPolicy_1a1febbe549d9cbe4502a5b66167ab9553" kindref="member">tf::cudaDefaultExecutionPolicy::merge_bufsz</ref></computeroutput>. The buffer size depends only on the two input vector sizes. </para>
</sect1>
</detaileddescription>
<location file="doxygen/cuda_std_algorithms/cuda_std_merge.dox"/>
</compounddef>
</doxygen>