-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Expand file tree
/
Copy pathParallelMerge.html
More file actions
271 lines (269 loc) · 19.2 KB
/
Copy pathParallelMerge.html
File metadata and controls
271 lines (269 loc) · 19.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
<!-- HTML header for doxygen 1.13.1-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=11"/>
<meta name="generator" content="Doxygen 1.13.1"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<title>Taskflow: A General-purpose Task-parallel Programming System: Parallel Merge</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<script type="text/javascript" src="clipboard.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="cookie.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="custom.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr id="projectrow">
<td id="projectlogo"><img alt="Logo" src="taskflow_logo.png"/></td>
<td id="projectalign">
<div id="projectname"><a href="https://github.com/taskflow/taskflow" style="color:inherit; text-decoration:none;">Taskflow: A General-purpose Task-parallel Programming System</a>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.13.1 -->
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
var searchBox = new SearchBox("searchBox", "search/",'.html');
/* @license-end */
</script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function() { codefold.init(0); });
/* @license-end */
</script>
<script type="text/javascript" src="menudata.js"></script>
<script type="text/javascript" src="menu.js"></script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function() {
initMenu('',true,false,'search.php','Search',true);
$(function() { init_search(); });
});
/* @license-end */
</script>
<div id="main-nav"></div>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function(){initNavTree('ParallelMerge.html',''); initResizable(true); });
/* @license-end */
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<div id="MSearchResults">
<div class="SRPage">
<div id="SRIndex">
<div id="SRResults"></div>
<div class="SRStatus" id="Loading">Loading...</div>
<div class="SRStatus" id="Searching">Searching...</div>
<div class="SRStatus" id="NoMatches">No Matches</div>
</div>
</div>
</div>
</div>
<div><div class="header">
<div class="headertitle"><div class="title">Parallel Merge</div></div>
</div><!--header-->
<div class="contents">
<div class="toc"><h3>Table of Contents</h3>
<ul>
<li class="level1">
<a href="#ParallelMergeInclude">Include the Header</a>
</li>
<li class="level1">
<a href="#ParallelMergeMotivation">Motivation</a>
</li>
<li class="level1">
<a href="#ParallelMergeCreate">Create a Parallel-Merge Task</a>
</li>
<li class="level1">
<a href="#ParallelMergeCaptureByReference">Capture Iterators by Reference</a>
</li>
<li class="level1">
<a href="#ParallelMergeAlgorithm">Understanding the Parallel Merge Algorithm</a>
<ul>
<li class="level2">
<a href="#ParallelMergeAlgorithmOverview">Step 1: Flat Output Partitioning</a>
</li>
<li class="level2">
<a href="#ParallelMergeAlgorithmCorank">Step 2: Co-rank to Identify Input Subranges</a>
</li>
<li class="level2">
<a href="#ParallelMergeAlgorithmBinarySearch">Step 3: How co_rank works? Binary Search!</a>
</li>
<li class="level2">
<a href="#ParallelMergeAlgorithmNoPartitioner">Why parallel merge uses no partitioner</a>
</li>
</ul>
</li>
</ul>
</div>
<div class="textblock"><p>Taskflow provides a function for constructing a task to merge two sorted ranges into a single sorted output range in parallel.</p>
<h1><a class="anchor" id="ParallelMergeInclude"></a>
Include the Header</h1>
<p>You need to include the header file, <code>taskflow/algorithm/merge.hpp</code>, for creating a parallel-merge task.</p>
<div class="fragment"><div class="line"><span class="preprocessor">#include <taskflow/algorithm/merge.hpp></span></div>
</div><!-- fragment --><h1><a class="anchor" id="ParallelMergeMotivation"></a>
Motivation</h1>
<p>The standard library <code>std::merge</code> walks both input sequences simultaneously from left to right in a single thread, producing a sorted output in O(n+m) time. While optimal for a single core, this sequential walk cannot exploit multiple CPU cores — at any point only one comparison is in flight. For large inputs (e.g. merging two sorted arrays of tens of millions of elements), this leaves the majority of available hardware idle.</p>
<p>Taskflow's parallel merge divides the output into <code>W</code> independent chunks (one per worker thread) and uses the <em>co-rank</em> technique to identify each worker's exact input sub-ranges, allowing all workers to merge their chunks simultaneously with no synchronization.</p>
<div class="dotgraph">
<iframe scrolling="no" frameborder="0" src="dot_merge_overview.svg" width="736" height="452"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe></div>
<h1><a class="anchor" id="ParallelMergeCreate"></a>
Create a Parallel-Merge Task</h1>
<p>The task created by <a class="el" href="classtf_1_1FlowBuilder.html#adba094978bb2c038163e0b1b18141efa" title="merges two sorted ranges into a single sorted output using the std::less comparator">tf::Taskflow::merge(B1 first1, E1 last1, B2 first2, E2 last2, O d_first)</a> merges the two sorted ranges <code>[first1, last1)</code> and <code>[first2, last2)</code> into the output range beginning at <code>d_first</code>, using <code>std::less</code> as the comparator. It represents the parallel execution of <code>std::merge:</code> </p>
<div class="fragment"><div class="line">std::merge(first1, last1, first2, last2, d_first);</div>
</div><!-- fragment --><p>The following example merges two sorted vectors of one million integers using four worker threads:</p>
<div class="fragment"><div class="line"><span class="preprocessor">#include <taskflow/algorithm/merge.hpp></span></div>
<div class="line"> </div>
<div class="line"><a class="code hl_class" href="classtf_1_1Taskflow.html">tf::Taskflow</a> taskflow;</div>
<div class="line"><a class="code hl_class" href="classtf_1_1Executor.html">tf::Executor</a> executor(4);</div>
<div class="line"> </div>
<div class="line">std::vector<int> seq1(1000000), seq2(1000000), output(2000000);</div>
<div class="line"> </div>
<div class="line"><span class="comment">// fill and sort seq1 and seq2 ...</span></div>
<div class="line"> </div>
<div class="line">taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#adba094978bb2c038163e0b1b18141efa">merge</a>(</div>
<div class="line"> seq1.begin(), seq1.end(),</div>
<div class="line"> seq2.begin(), seq2.end(),</div>
<div class="line"> output.begin()</div>
<div class="line">);</div>
<div class="line"> </div>
<div class="line">executor.run(taskflow).wait();</div>
<div class="line"><span class="comment">// output is now the sorted merge of seq1 and seq2</span></div>
<div class="ttc" id="aclasstf_1_1Executor_html"><div class="ttname"><a href="classtf_1_1Executor.html">tf::Executor</a></div><div class="ttdoc">class to create an executor</div><div class="ttdef"><b>Definition</b> executor.hpp:62</div></div>
<div class="ttc" id="aclasstf_1_1FlowBuilder_html_adba094978bb2c038163e0b1b18141efa"><div class="ttname"><a href="classtf_1_1FlowBuilder.html#adba094978bb2c038163e0b1b18141efa">tf::FlowBuilder::merge</a></div><div class="ttdeci">Task merge(B1 first1, E1 last1, B2 first2, E2 last2, O d_first)</div><div class="ttdoc">merges two sorted ranges into a single sorted output using the std::less comparator</div></div>
<div class="ttc" id="aclasstf_1_1Taskflow_html"><div class="ttname"><a href="classtf_1_1Taskflow.html">tf::Taskflow</a></div><div class="ttdoc">class to create a taskflow object</div><div class="ttdef"><b>Definition</b> taskflow.hpp:64</div></div>
</div><!-- fragment --><p>To merge with a custom comparator, pass it as the last argument. Both input ranges must be sorted with respect to that comparator:</p>
<div class="fragment"><div class="line"><span class="comment">// descending order</span></div>
<div class="line">std::sort(seq1.begin(), seq1.end(), std::greater<int>{});</div>
<div class="line">std::sort(seq2.begin(), seq2.end(), std::greater<int>{});</div>
<div class="line"> </div>
<div class="line">taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#adba094978bb2c038163e0b1b18141efa">merge</a>(</div>
<div class="line"> seq1.begin(), seq1.end(),</div>
<div class="line"> seq2.begin(), seq2.end(),</div>
<div class="line"> output.begin(),</div>
<div class="line"> std::greater<int>{}</div>
<div class="line">);</div>
<div class="line"> </div>
<div class="line">executor.run(taskflow).wait();</div>
</div><!-- fragment --><dl class="section note"><dt>Note</dt><dd>Both input ranges must be sorted with respect to the comparator before the merge task runs. Passing unsorted input is undefined behavior. The output range must not overlap either input range. Both input iterators must be random-access iterators.</dd></dl>
<h1><a class="anchor" id="ParallelMergeCaptureByReference"></a>
Capture Iterators by Reference</h1>
<p>You can pass iterators by reference using <a href="https://en.cppreference.com/w/cpp/utility/functional/ref">std::ref</a> to marshal parameter updates between dependent tasks. This is useful when the ranges are not known at task-graph construction time but are initialized by an upstream task.</p>
<div class="fragment"><div class="line">std::vector<int> seq1, seq2, output;</div>
<div class="line">std::vector<int>::iterator beg1, end1, beg2, end2, d_beg;</div>
<div class="line"> </div>
<div class="line"><a class="code hl_class" href="classtf_1_1Task.html">tf::Task</a> init = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">emplace</a>([&]() {</div>
<div class="line"> <span class="comment">// fill and sort seq1, seq2 at runtime</span></div>
<div class="line"> beg1 = seq1.begin(); end1 = seq1.end();</div>
<div class="line"> beg2 = seq2.begin(); end2 = seq2.end();</div>
<div class="line"> d_beg = output.begin();</div>
<div class="line">});</div>
<div class="line"> </div>
<div class="line"><a class="code hl_class" href="classtf_1_1Task.html">tf::Task</a> merge_task = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#adba094978bb2c038163e0b1b18141efa">merge</a>(</div>
<div class="line"> std::ref(beg1), std::ref(end1),</div>
<div class="line"> std::ref(beg2), std::ref(end2),</div>
<div class="line"> std::ref(d_beg)</div>
<div class="line">);</div>
<div class="line"> </div>
<div class="line"><span class="comment">// wrong! iterators are captured by copy at construction time</span></div>
<div class="line"><span class="comment">// tf::Task merge_task = taskflow.merge(beg1, end1, beg2, end2, d_beg);</span></div>
<div class="line"> </div>
<div class="line">init.<a class="code hl_function" href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">precede</a>(merge_task);</div>
<div class="line">executor.run(taskflow).wait();</div>
<div class="ttc" id="aclasstf_1_1FlowBuilder_html_a4d52a7fe2814b264846a2085e931652c"><div class="ttname"><a href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">tf::FlowBuilder::emplace</a></div><div class="ttdeci">Task emplace(C &&callable)</div><div class="ttdoc">creates a static task</div><div class="ttdef"><b>Definition</b> flow_builder.hpp:1571</div></div>
<div class="ttc" id="aclasstf_1_1Task_html"><div class="ttname"><a href="classtf_1_1Task.html">tf::Task</a></div><div class="ttdoc">class to create a task handle over a taskflow node</div><div class="ttdef"><b>Definition</b> task.hpp:569</div></div>
<div class="ttc" id="aclasstf_1_1Task_html_a8c78c453295a553c1c016e4062da8588"><div class="ttname"><a href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">tf::Task::precede</a></div><div class="ttdeci">Task & precede(Ts &&... tasks)</div><div class="ttdoc">adds precedence links from this to other tasks</div><div class="ttdef"><b>Definition</b> task.hpp:1258</div></div>
</div><!-- fragment --><p>When <code>init</code> finishes, the merge task reads the updated iterators and merges the two runtime-defined sequences.</p>
<h1><a class="anchor" id="ParallelMergeAlgorithm"></a>
Understanding the Parallel Merge Algorithm</h1>
<h2><a class="anchor" id="ParallelMergeAlgorithmOverview"></a>
Step 1: Flat Output Partitioning</h2>
<p>The merged output has <code>N</code> = n + m elements (where<code>n = |seq1|</code> and <code>m = |seq2|</code>). <br />
The algorithm divides the output into <code>W</code> equal contiguous chunks of size <code>K = N/W</code>, one per worker thread. Worker <code>w</code> is responsible for writing output positions <code>[w*K, (w+1)*K)</code>. The key challenge here is that for each output chunk, which elements of <code>seq1</code> and <code>seq2</code> belong to it? This is what the co-rank function solves.</p>
<h2><a class="anchor" id="ParallelMergeAlgorithmCorank"></a>
Step 2: Co-rank to Identify Input Subranges</h2>
<p>For a given output position <code>rank</code>, <code>co_rank(rank)</code> finds <code>i</code> and <code>j</code> such that:</p>
<ul>
<li>The first <code>rank</code> elements of the merged output consist of exactly <code>i</code> elements from <code>seq1</code>[0..i) and <code>j</code> elements from <code>seq2</code>[0..j).</li>
<li>These <code>i</code> + <code>j</code> elements are <em>interleaved</em> in sorted order in the output. Note that <code>co_rank</code> does not decide how they go in blocks but only identifies <em>how</em> many elements come from each sequence. The actual placement of these elements are accomplished via std::merge.</li>
</ul>
<p>Once a worker knows <code>(i_start, j_start)</code> at its chunk's beginning and <code>(i_end, j_end)</code> at its chunk's end, it can independently merge <code>seq1[i_start..i_end)</code> with <code>seq2[j_start..j_end)</code> and write the result directly to its output region, with no communication with other workers.</p>
<div class="image">
<object type="image/svg+xml" data="merge_partition.svg" width="90%" style="pointer-events: none;"></object>
</div>
<h2><a class="anchor" id="ParallelMergeAlgorithmBinarySearch"></a>
Step 3: How co_rank works? Binary Search!</h2>
<p>For a given <code>rank</code>, co_rank binary-searches for the unique <code>i</code> in the range <code>[max(0, rank-m), min(n, rank)]</code> such that the partition <code>(seq1[0..i), seq2[0..rank-i))</code> is valid. A partition is <em>valid</em> when:</p>
<ul>
<li>The last element of seq1's slice does not exceed the first unused element of <code>seq2</code>: <code>seq1[i-1] <= seq2[j]</code></li>
<li>The last element of seq2's slice does not exceed the first unused element of <code>seq1</code>: <code>seq2[j-1] <= seq1[i]</code></li>
</ul>
<p>Because both sequences are sorted, as <code>i</code> increases:</p><ul>
<li><code>seq1[i-1]</code> increases (moving right in <code>seq1</code>)</li>
<li><code>seq2[j-1]</code> decreases (moving left in <code>seq2</code>, since <code>j = rank - i</code> decreases)</li>
</ul>
<p>This means the condition <code>seq2</code>[j-1] <= <code>seq1</code>[i] transitions from <code>false</code> to <code>true</code> exactly once — making binary search applicable. The algorithm needs to check only one condition per iteration:</p>
<ul>
<li>If <code>seq2</code>[j-1] > <code>seq1</code>[i]: <code>i</code> is too large -> <code>high</code> = i</li>
<li>Otherwise: <code>i</code> is not large enough -> <code>low</code> = i + 1</li>
</ul>
<p>The figure below shows two iterations of the binary search for <code>rank=5</code> on sequences <code>seq1=[1,3,5,7,9,11]</code> and <code>seq2=[2,4,6,8,10,12]</code>:</p>
<div class="image">
<object type="image/svg+xml" data="merge_corank_search.svg" width="90%" style="pointer-events: none;"></object>
</div>
<p>The search converges to <code>i=3</code>, <code>j=2</code>: the first 5 merged elements consist of <code>seq1[0,3)=[1,3,5]</code> and <code>seq2[0,2)=[2,4]</code>, which interleave to <code>[1,2,3,4,5]</code>.</p>
<h2><a class="anchor" id="ParallelMergeAlgorithmNoPartitioner"></a>
Why parallel merge uses no partitioner</h2>
<p>Unlike <a class="el" href="classtf_1_1FlowBuilder.html#aae3edfa278baa75b08414e083c14c836" title="constructs an STL-styled parallel-for task">tf::Taskflow::for_each</a>, where users can configure a partitioner to adapt to unequal per-element costs, <code>std::merge</code> on a chunk of size <code>K</code> always costs <code>O(K)</code> regardless of the data values. There is little load imbalance to mitigate. As a result, <a class="el" href="classtf_1_1FlowBuilder.html#adba094978bb2c038163e0b1b18141efa" title="merges two sorted ranges into a single sorted output using the std::less comparator">tf::Taskflow::merge</a> always adopts static partitioning, i.e., <code>W</code> chunks of size <code>N/W</code> and one per worker. which is always the optimal strategy for parallel merge. </p>
</div></div><!-- contents -->
</div><!-- PageDoc -->
</div><!-- doc-content -->
<!-- HTML footer for doxygen 1.13.1-->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="navelem"><a class="el" href="Algorithms.html">Taskflow Algorithms</a></li>
<li class="footer">
Maintained by <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>
—
Generated by <a href="https://www.doxygen.org/index.html"><img class="footer" src="doxygen.svg" width="104" height="31" alt="doxygen"/></a> 1.13.1
</li>
</ul>
</div>