-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Expand file tree
/
Copy pathTextProcessingPipeline.html
More file actions
274 lines (272 loc) · 18.8 KB
/
Copy pathTextProcessingPipeline.html
File metadata and controls
274 lines (272 loc) · 18.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
<!-- HTML header for doxygen 1.13.1-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=11"/>
<meta name="generator" content="Doxygen 1.13.1"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<title>Taskflow: A General-purpose Task-parallel Programming System: Text Processing Pipeline</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<script type="text/javascript" src="clipboard.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="cookie.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="custom.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr id="projectrow">
<td id="projectlogo"><img alt="Logo" src="taskflow_logo.png"/></td>
<td id="projectalign">
<div id="projectname"><a href="https://github.com/taskflow/taskflow" style="color:inherit; text-decoration:none;">Taskflow: A General-purpose Task-parallel Programming System</a>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.13.1 -->
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
var searchBox = new SearchBox("searchBox", "search/",'.html');
/* @license-end */
</script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function() { codefold.init(0); });
/* @license-end */
</script>
<script type="text/javascript" src="menudata.js"></script>
<script type="text/javascript" src="menu.js"></script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function() {
initMenu('',true,false,'search.php','Search',true);
$(function() { init_search(); });
});
/* @license-end */
</script>
<div id="main-nav"></div>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function(){initNavTree('TextProcessingPipeline.html',''); initResizable(true); });
/* @license-end */
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<div id="MSearchResults">
<div class="SRPage">
<div id="SRIndex">
<div id="SRResults"></div>
<div class="SRStatus" id="Loading">Loading...</div>
<div class="SRStatus" id="Searching">Searching...</div>
<div class="SRStatus" id="NoMatches">No Matches</div>
</div>
</div>
</div>
</div>
<div><div class="header">
<div class="headertitle"><div class="title">Text Processing Pipeline</div></div>
</div><!--header-->
<div class="contents">
<div class="toc"><h3>Table of Contents</h3>
<ul>
<li class="level1">
<a href="#FormulateTheTextProcessingPipelineProblem">Problem Formulation</a>
</li>
<li class="level1">
<a href="#CreateAParallelTextPipeline">Creating the Pipeline</a>
<ul>
<li class="level2">
<a href="#TextPipelineBuffer">Data Buffer</a>
</li>
<li class="level2">
<a href="#TextPipelineOutput">Sample Output</a>
</li>
</ul>
</li>
</ul>
</div>
<div class="textblock"><p>We study a text processing pipeline that finds the most frequent character in each string from an input source, demonstrating how Taskflow's pipeline model overlaps serial and parallel stages to process a stream of tokens efficiently.</p>
<h1><a class="anchor" id="FormulateTheTextProcessingPipelineProblem"></a>
Problem Formulation</h1>
<p>Given a vector of strings, we want to find the most frequent character in each string and output the result in the same order as the input. For example:</p>
<div class="fragment"><div class="line"># input</div>
<div class="line">abade ddddf eefge xyzzd ijjjj jiiii kkijk</div>
<div class="line"> </div>
<div class="line"># output (most frequent character : count)</div>
<div class="line">a:2 d:4 e:3 z:2 j:4 i:4 k:3</div>
</div><!-- fragment --><p>We decompose the computation into three stages:</p>
<ol type="1">
<li><b>Read</b> (serial) — read one string from the input vector in order</li>
<li><b>Count</b> (parallel) — build a character-frequency map from the string</li>
<li><b>Reduce</b> (serial) — find the most frequent character from the map and output the result</li>
</ol>
<p>The first and third stages must run serially to preserve input/output order. The second stage is independent across strings and can run in parallel across multiple pipeline lines.</p>
<h1><a class="anchor" id="CreateAParallelTextPipeline"></a>
Creating the Pipeline</h1>
<p>We create a pipeline of three pipes with two parallel lines. A larger line count increases throughput at the cost of memory — in practice, <code>std::thread::hardware_concurrency</code> is a good default.</p>
<div class="fragment"><div class="line"><span class="preprocessor">#include <taskflow/taskflow.hpp></span></div>
<div class="line"><span class="preprocessor">#include <taskflow/algorithm/pipeline.hpp></span></div>
<div class="line"> </div>
<div class="line">std::string format_map(<span class="keyword">const</span> std::unordered_map<char, size_t>& map) {</div>
<div class="line"> std::ostringstream oss;</div>
<div class="line"> <span class="keywordflow">for</span>(<span class="keyword">const</span> <span class="keyword">auto</span>& [c, n] : map) oss << c << <span class="charliteral">':'</span> << n << <span class="charliteral">' '</span>;</div>
<div class="line"> <span class="keywordflow">return</span> oss.str();</div>
<div class="line">}</div>
<div class="line"> </div>
<div class="line"><span class="keywordtype">int</span> main() {</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Taskflow.html">tf::Taskflow</a> taskflow(<span class="stringliteral">"text-pipeline"</span>);</div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Executor.html">tf::Executor</a> executor;</div>
<div class="line"> </div>
<div class="line"> <span class="keyword">const</span> <span class="keywordtype">size_t</span> num_lines = 2;</div>
<div class="line"> </div>
<div class="line"> std::vector<std::string> input = {</div>
<div class="line"> <span class="stringliteral">"abade"</span>, <span class="stringliteral">"ddddf"</span>, <span class="stringliteral">"eefge"</span>, <span class="stringliteral">"xyzzd"</span>, <span class="stringliteral">"ijjjj"</span>, <span class="stringliteral">"jiiii"</span>, <span class="stringliteral">"kkijk"</span></div>
<div class="line"> };</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// one buffer slot per pipeline line; each slot holds the data for one token</span></div>
<div class="line"> <span class="keyword">using </span>data_type = std::variant<</div>
<div class="line"> std::string,</div>
<div class="line"> std::unordered_map<char, size_t>,</div>
<div class="line"> std::pair<char, size_t></div>
<div class="line"> >;</div>
<div class="line"> std::array<data_type, num_lines> buffer;</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Pipeline.html">tf::Pipeline</a> pl(num_lines,</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// stage 1 (serial): read the next input string</span></div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Pipe.html">tf::Pipe</a>{<a class="code hl_enumvalue" href="namespacetf.html#abb7a11e41fd457f69e7ff45d4c769564a7b804a28d6154ab8007287532037f1d0">tf::PipeType::SERIAL</a>, [&](<a class="code hl_class" href="classtf_1_1Pipeflow.html">tf::Pipeflow</a>& pf) {</div>
<div class="line"> <span class="keywordflow">if</span>(pf.token() == input.size()) {</div>
<div class="line"> pf.stop(); <span class="comment">// no more tokens — shut the pipeline down</span></div>
<div class="line"> }</div>
<div class="line"> <span class="keywordflow">else</span> {</div>
<div class="line"> printf(<span class="stringliteral">"stage 1: %s\n"</span>, input[pf.token()].c_str());</div>
<div class="line"> buffer[pf.line()] = input[pf.token()];</div>
<div class="line"> }</div>
<div class="line"> }},</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// stage 2 (parallel): build a character-frequency map</span></div>
<div class="line"> tf::Pipe{<a class="code hl_enumvalue" href="namespacetf.html#abb7a11e41fd457f69e7ff45d4c769564adf13a99b035d6f0bce4f44ab18eec8eb">tf::PipeType::PARALLEL</a>, [&](tf::Pipeflow& pf) {</div>
<div class="line"> std::unordered_map<char, size_t> map;</div>
<div class="line"> <span class="keywordflow">for</span>(<span class="keywordtype">char</span> c : std::get<std::string>(buffer[pf.line()])) map[c]++;</div>
<div class="line"> printf(<span class="stringliteral">"stage 2: %s\n"</span>, format_map(map).c_str());</div>
<div class="line"> buffer[pf.line()] = map;</div>
<div class="line"> }},</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// stage 3 (serial): find and report the most frequent character</span></div>
<div class="line"> tf::Pipe{<a class="code hl_enumvalue" href="namespacetf.html#abb7a11e41fd457f69e7ff45d4c769564a7b804a28d6154ab8007287532037f1d0">tf::PipeType::SERIAL</a>, [&](tf::Pipeflow& pf) {</div>
<div class="line"> <span class="keyword">auto</span>& map = std::get<std::unordered_map<char, size_t>>(buffer[pf.line()]);</div>
<div class="line"> <span class="keyword">auto</span> sol = std::max_element(map.begin(), map.end(),</div>
<div class="line"> [](<span class="keyword">const</span> <span class="keyword">auto</span>& a, <span class="keyword">const</span> <span class="keyword">auto</span>& b) { return a.second < b.second; }</div>
<div class="line"> );</div>
<div class="line"> printf(<span class="stringliteral">"stage 3: %c:%zu\n"</span>, sol->first, sol->second);</div>
<div class="line"> buffer[pf.line()] = *sol;</div>
<div class="line"> }}</div>
<div class="line"> );</div>
<div class="line"> </div>
<div class="line"> tf::Task init = taskflow.emplace([](){ std::cout << <span class="stringliteral">"ready\n"</span>; })</div>
<div class="line"> .name(<span class="stringliteral">"start"</span>);</div>
<div class="line"> tf::Task pipe = taskflow.composed_of(pl)</div>
<div class="line"> .name(<span class="stringliteral">"pipeline"</span>);</div>
<div class="line"> tf::Task done = taskflow.emplace([](){ std::cout << <span class="stringliteral">"done\n"</span>; })</div>
<div class="line"> .name(<span class="stringliteral">"stop"</span>);</div>
<div class="line"> </div>
<div class="line"> init.precede(pipe);</div>
<div class="line"> pipe.<a class="code hl_function" href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">precede</a>(done);</div>
<div class="line"> </div>
<div class="line"> executor.<a class="code hl_function" href="classtf_1_1Executor.html#a519777f5783981d534e9e53b99712069">run</a>(taskflow).wait();</div>
<div class="line"> </div>
<div class="line"> <span class="keywordflow">return</span> 0;</div>
<div class="line">}</div>
<div class="ttc" id="aclasstf_1_1Executor_html"><div class="ttname"><a href="classtf_1_1Executor.html">tf::Executor</a></div><div class="ttdoc">class to create an executor</div><div class="ttdef"><b>Definition</b> executor.hpp:62</div></div>
<div class="ttc" id="aclasstf_1_1Executor_html_a519777f5783981d534e9e53b99712069"><div class="ttname"><a href="classtf_1_1Executor.html#a519777f5783981d534e9e53b99712069">tf::Executor::run</a></div><div class="ttdeci">tf::Future< void > run(Taskflow &taskflow)</div><div class="ttdoc">runs a taskflow once</div></div>
<div class="ttc" id="aclasstf_1_1Pipe_html"><div class="ttname"><a href="classtf_1_1Pipe.html">tf::Pipe</a></div><div class="ttdoc">class to create a pipe object for a pipeline stage</div><div class="ttdef"><b>Definition</b> pipeline.hpp:144</div></div>
<div class="ttc" id="aclasstf_1_1Pipeflow_html"><div class="ttname"><a href="classtf_1_1Pipeflow.html">tf::Pipeflow</a></div><div class="ttdoc">class to create a pipeflow object used by the pipe callable</div><div class="ttdef"><b>Definition</b> pipeline.hpp:43</div></div>
<div class="ttc" id="aclasstf_1_1Pipeline_html"><div class="ttname"><a href="classtf_1_1Pipeline.html">tf::Pipeline</a></div><div class="ttdoc">class to create a pipeline scheduling framework</div><div class="ttdef"><b>Definition</b> pipeline.hpp:307</div></div>
<div class="ttc" id="aclasstf_1_1Task_html_a8c78c453295a553c1c016e4062da8588"><div class="ttname"><a href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">tf::Task::precede</a></div><div class="ttdeci">Task & precede(Ts &&... tasks)</div><div class="ttdoc">adds precedence links from this to other tasks</div><div class="ttdef"><b>Definition</b> task.hpp:1258</div></div>
<div class="ttc" id="aclasstf_1_1Taskflow_html"><div class="ttname"><a href="classtf_1_1Taskflow.html">tf::Taskflow</a></div><div class="ttdoc">class to create a taskflow object</div><div class="ttdef"><b>Definition</b> taskflow.hpp:64</div></div>
<div class="ttc" id="anamespacetf_html_abb7a11e41fd457f69e7ff45d4c769564a7b804a28d6154ab8007287532037f1d0"><div class="ttname"><a href="namespacetf.html#abb7a11e41fd457f69e7ff45d4c769564a7b804a28d6154ab8007287532037f1d0">tf::PipeType::SERIAL</a></div><div class="ttdeci">@ SERIAL</div><div class="ttdoc">serial type</div><div class="ttdef"><b>Definition</b> pipeline.hpp:117</div></div>
<div class="ttc" id="anamespacetf_html_abb7a11e41fd457f69e7ff45d4c769564adf13a99b035d6f0bce4f44ab18eec8eb"><div class="ttname"><a href="namespacetf.html#abb7a11e41fd457f69e7ff45d4c769564adf13a99b035d6f0bce4f44ab18eec8eb">tf::PipeType::PARALLEL</a></div><div class="ttdeci">@ PARALLEL</div><div class="ttdoc">parallel type</div><div class="ttdef"><b>Definition</b> pipeline.hpp:115</div></div>
</div><!-- fragment --><h2><a class="anchor" id="TextPipelineBuffer"></a>
Data Buffer</h2>
<p>Taskflow gives users full control over data management in a pipeline. We allocate a one-dimensional buffer indexed by pipeline line:</p>
<div class="fragment"><div class="line">std::array<data_type, num_lines> buffer;</div>
</div><!-- fragment --><p>A one-dimensional buffer is sufficient because Taskflow guarantees that at most one scheduling token is active per line at any time, so no two tokens will read or write the same buffer slot simultaneously.</p>
<dl class="section note"><dt>Note</dt><dd>Only input elements are transformed by stage functions — the pipeline scheduling token index (<a class="el" href="classtf_1_1Pipeflow.html#a295e5d884665c076f4ef5d78139f7c51" title="queries the token identifier">tf::Pipeflow::token</a>) identifies which input element is being processed, and the line index (<a class="el" href="classtf_1_1Pipeflow.html#afee054e6a99965d4b3e36ff903227e6c" title="queries the line identifier of the present token">tf::Pipeflow::line</a>) identifies which buffer slot to use.</dd></dl>
<h2><a class="anchor" id="TextPipelineOutput"></a>
Sample Output</h2>
<p>Because stage 2 is a parallel pipe, its output may interleave across lines. One possible execution trace:</p>
<div class="fragment"><div class="line">ready</div>
<div class="line">stage 1: abade</div>
<div class="line">stage 1: ddddf</div>
<div class="line">stage 2: f:1 d:4</div>
<div class="line">stage 2: e:1 d:1 a:2 b:1</div>
<div class="line">stage 3: a:2</div>
<div class="line">stage 1: eefge</div>
<div class="line">stage 2: g:1 e:3 f:1</div>
<div class="line">stage 3: d:4</div>
<div class="line">stage 1: xyzzd</div>
<div class="line">stage 3: e:3</div>
<div class="line">stage 1: ijjjj</div>
<div class="line">stage 2: z:2 x:1 d:1 y:1</div>
<div class="line">stage 3: z:2</div>
<div class="line">stage 1: jiiii</div>
<div class="line">stage 2: j:4 i:1</div>
<div class="line">stage 3: j:4</div>
<div class="line">stage 2: i:4 j:1</div>
<div class="line">stage 1: kkijk</div>
<div class="line">stage 3: i:4</div>
<div class="line">stage 2: j:1 k:3 i:1</div>
<div class="line">stage 3: k:3</div>
<div class="line">done</div>
</div><!-- fragment --><p>The seven stage-3 outputs appear in the same order as the input (<code>a:2</code>, <code>d:4</code>, <code>e:3</code>, <code>z:2</code>, <code>j:4</code>, <code>i:4</code>, <code>k:3</code>), as guaranteed by the serial pipe declaration. The pipeline task graph is shown below:</p>
<div class="dotgraph">
<iframe scrolling="no" frameborder="0" src="dot_text_processing_pipeline.svg" width="446" height="350"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe></div>
</div></div><!-- contents -->
</div><!-- PageDoc -->
</div><!-- doc-content -->
<!-- HTML footer for doxygen 1.13.1-->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="navelem"><a class="el" href="Examples.html">Learning from Examples</a></li>
<li class="footer">
Maintained by <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>
—
Generated by <a href="https://www.doxygen.org/index.html"><img class="footer" src="doxygen.svg" width="104" height="31" alt="doxygen"/></a> 1.13.1
</li>
</ul>
</div>