-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Expand file tree
/
Copy pathMatrixMultiplicationWithCUDAGPU.html
More file actions
257 lines (255 loc) · 22.4 KB
/
Copy pathMatrixMultiplicationWithCUDAGPU.html
File metadata and controls
257 lines (255 loc) · 22.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
<!-- HTML header for doxygen 1.13.1-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=11"/>
<meta name="generator" content="Doxygen 1.13.1"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<title>Taskflow: A General-purpose Task-parallel Programming System: Matrix Multiplication with CUDA GPU</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<script type="text/javascript" src="clipboard.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="cookie.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="custom.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr id="projectrow">
<td id="projectlogo"><img alt="Logo" src="taskflow_logo.png"/></td>
<td id="projectalign">
<div id="projectname"><a href="https://github.com/taskflow/taskflow" style="color:inherit; text-decoration:none;">Taskflow: A General-purpose Task-parallel Programming System</a>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.13.1 -->
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
var searchBox = new SearchBox("searchBox", "search/",'.html');
/* @license-end */
</script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function() { codefold.init(0); });
/* @license-end */
</script>
<script type="text/javascript" src="menudata.js"></script>
<script type="text/javascript" src="menu.js"></script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function() {
initMenu('',true,false,'search.php','Search',true);
$(function() { init_search(); });
});
/* @license-end */
</script>
<div id="main-nav"></div>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt MIT */
$(function(){initNavTree('MatrixMultiplicationWithCUDAGPU.html',''); initResizable(true); });
/* @license-end */
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<div id="MSearchResults">
<div class="SRPage">
<div id="SRIndex">
<div id="SRResults"></div>
<div class="SRStatus" id="Loading">Loading...</div>
<div class="SRStatus" id="Searching">Searching...</div>
<div class="SRStatus" id="NoMatches">No Matches</div>
</div>
</div>
</div>
</div>
<div><div class="header">
<div class="headertitle"><div class="title">Matrix Multiplication with CUDA GPU</div></div>
</div><!--header-->
<div class="contents">
<div class="toc"><h3>Table of Contents</h3>
<ul>
<li class="level1">
<a href="#GPUAcceleratedMatrixMultiplication">CUDA Kernel</a>
</li>
<li class="level1">
<a href="#DefineACUDAGraphForMatrixMultiplication">CUDA Graph Task</a>
</li>
<li class="level1">
<a href="#MatrixMultiplicationcudaFlowBenchmarking">Benchmarking</a>
</li>
</ul>
</div>
<div class="textblock"><p>Following <a class="el" href="matrix_multiplication.html">Matrix Multiplication</a>, we accelerate matrix multiplication on a CUDA GPU using <a class="el" href="namespacetf.html#a713c427e4f9841a90dec67045a3babed" title="default smart pointer type to manage a cudaGraph_t object with unique ownership">tf::cudaGraph</a>. The GPU's massive thread-level parallelism reduces large problem runtimes from minutes to milliseconds.</p>
<h1><a class="anchor" id="GPUAcceleratedMatrixMultiplication"></a>
CUDA Kernel</h1>
<p>Unlike the CPU version where each task processes one row, on the GPU we assign one CUDA thread to each element of <code>C</code>. We store all matrices in 1D row-major layout to simplify host-to-device transfers: element <code>(x, y)</code> in a matrix of width <code>W</code> is stored at index <code>x * W + y</code>.</p>
<div class="image">
<img src="matrix_multiplication_4.png" alt="" width="70%"/>
</div>
<p>The CUDA kernel is:</p>
<div class="fragment"><div class="line">__global__ <span class="keywordtype">void</span> matmul(<span class="keywordtype">int</span>* A, <span class="keywordtype">int</span>* B, <span class="keywordtype">int</span>* C, <span class="keywordtype">int</span> M, <span class="keywordtype">int</span> K, <span class="keywordtype">int</span> N) {</div>
<div class="line"> <span class="keywordtype">int</span> row = blockIdx.y * blockDim.y + threadIdx.y;</div>
<div class="line"> <span class="keywordtype">int</span> col = blockIdx.x * blockDim.x + threadIdx.x;</div>
<div class="line"> <span class="keywordtype">int</span> sum = 0;</div>
<div class="line"> <span class="keywordflow">if</span>(row < M && col < N) {</div>
<div class="line"> <span class="keywordflow">for</span>(<span class="keywordtype">int</span> i = 0; i < K; i++) {</div>
<div class="line"> sum += A[row * K + i] * B[i * N + col];</div>
<div class="line"> }</div>
<div class="line"> C[row * N + col] = sum;</div>
<div class="line"> }</div>
<div class="line">}</div>
</div><!-- fragment --><p>Each thread computes one element of <code>C</code> by iterating over the full inner dimension <code>K</code>.</p>
<h1><a class="anchor" id="DefineACUDAGraphForMatrixMultiplication"></a>
CUDA Graph Task</h1>
<p>We build a Taskflow that allocates GPU memory in parallel, runs the CUDA graph, and frees GPU memory when done:</p>
<div class="fragment"><div class="line"><span class="keywordtype">void</span> matrix_multiplication(<span class="keywordtype">int</span>* A, <span class="keywordtype">int</span>* B, <span class="keywordtype">int</span>* C, <span class="keywordtype">int</span> M, <span class="keywordtype">int</span> K, <span class="keywordtype">int</span> N) {</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Taskflow.html">tf::Taskflow</a> taskflow;</div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Executor.html">tf::Executor</a> executor;</div>
<div class="line"> </div>
<div class="line"> <span class="keywordtype">int</span> *da, *db, *dc;</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// allocate GPU memory for A, B, and C in parallel</span></div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Task.html">tf::Task</a> allocate_a = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">emplace</a>([&]() {</div>
<div class="line"> cudaMalloc(&da, M * K * <span class="keyword">sizeof</span>(<span class="keywordtype">int</span>));</div>
<div class="line"> }).name(<span class="stringliteral">"allocate_a"</span>);</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Task.html">tf::Task</a> allocate_b = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">emplace</a>([&]() {</div>
<div class="line"> cudaMalloc(&db, K * N * <span class="keyword">sizeof</span>(<span class="keywordtype">int</span>));</div>
<div class="line"> }).name(<span class="stringliteral">"allocate_b"</span>);</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_class" href="classtf_1_1Task.html">tf::Task</a> allocate_c = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">emplace</a>([&]() {</div>
<div class="line"> cudaMalloc(&dc, M * N * <span class="keyword">sizeof</span>(<span class="keywordtype">int</span>));</div>
<div class="line"> }).name(<span class="stringliteral">"allocate_c"</span>);</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// build and execute the CUDA graph</span></div>
<div class="line"> tf::Task cuda = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">emplace</a>([&]() {</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_typedef" href="namespacetf.html#a713c427e4f9841a90dec67045a3babed">tf::cudaGraph</a> cg;</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// H2D transfers for A and B</span></div>
<div class="line"> tf::cudaTask copy_da = cg.<a class="code hl_function" href="classtf_1_1cudaGraphBase.html#a02a041d5dd9e1e8958eb43e09331051e">copy</a>(da, A, M * K);</div>
<div class="line"> tf::cudaTask copy_db = cg.<a class="code hl_function" href="classtf_1_1cudaGraphBase.html#a02a041d5dd9e1e8958eb43e09331051e">copy</a>(db, B, K * N);</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// kernel: one thread per element of C</span></div>
<div class="line"> dim3 grid ((N + 15) / 16, (M + 15) / 16);</div>
<div class="line"> dim3 block(16, 16);</div>
<div class="line"> tf::cudaTask kmatmul = cg.<a class="code hl_function" href="classtf_1_1cudaGraphBase.html#a1473a15a6023fbc25e1f029f2ff84aec">kernel</a>(grid, block, 0,</div>
<div class="line"> matmul, da, db, dc, M, K, N</div>
<div class="line"> );</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// D2H transfer for C</span></div>
<div class="line"> tf::cudaTask copy_hc = cg.<a class="code hl_function" href="classtf_1_1cudaGraphBase.html#a02a041d5dd9e1e8958eb43e09331051e">copy</a>(C, dc, M * N);</div>
<div class="line"> </div>
<div class="line"> kmatmul.<a class="code hl_function" href="classtf_1_1cudaTask.html#a4a9ca1a34bac47e4c9b04eb4fb2f7775">succeed</a>(copy_da, copy_db)</div>
<div class="line"> .<a class="code hl_function" href="classtf_1_1cudaTask.html#abdd68287ec4dff4216af34d1db44d1b4">precede</a>(copy_hc);</div>
<div class="line"> </div>
<div class="line"> <a class="code hl_typedef" href="namespacetf.html#af19c9b301dc0b0fe2a51a960fa427e83">tf::cudaStream</a> stream;</div>
<div class="line"> <a class="code hl_typedef" href="namespacetf.html#a2be50e6880ead1d49a3fec2fc4bb893e">tf::cudaGraphExec</a> exec(cg);</div>
<div class="line"> stream.<a class="code hl_function" href="classtf_1_1cudaStreamBase.html#a7dcdfb79385a57c4c59b7c9f21e8beb9">run</a>(exec).<a class="code hl_function" href="classtf_1_1cudaStreamBase.html#a1e5140505629afd4b3422399f8080cb0">synchronize</a>();</div>
<div class="line"> </div>
<div class="line"> }).name(<span class="stringliteral">"cuda"</span>);</div>
<div class="line"> </div>
<div class="line"> <span class="comment">// free GPU memory</span></div>
<div class="line"> tf::Task free_mem = taskflow.<a class="code hl_function" href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">emplace</a>([&]() {</div>
<div class="line"> cudaFree(da);</div>
<div class="line"> cudaFree(db);</div>
<div class="line"> cudaFree(dc);</div>
<div class="line"> }).name(<span class="stringliteral">"free"</span>);</div>
<div class="line"> </div>
<div class="line"> cuda.<a class="code hl_function" href="classtf_1_1Task.html#a331b1b726555072e7c7d10941257f664">succeed</a>(allocate_a, allocate_b, allocate_c)</div>
<div class="line"> .<a class="code hl_function" href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">precede</a>(free_mem);</div>
<div class="line"> </div>
<div class="line"> executor.<a class="code hl_function" href="classtf_1_1Executor.html#a519777f5783981d534e9e53b99712069">run</a>(taskflow).wait();</div>
<div class="line">}</div>
<div class="ttc" id="aclasstf_1_1Executor_html"><div class="ttname"><a href="classtf_1_1Executor.html">tf::Executor</a></div><div class="ttdoc">class to create an executor</div><div class="ttdef"><b>Definition</b> executor.hpp:62</div></div>
<div class="ttc" id="aclasstf_1_1Executor_html_a519777f5783981d534e9e53b99712069"><div class="ttname"><a href="classtf_1_1Executor.html#a519777f5783981d534e9e53b99712069">tf::Executor::run</a></div><div class="ttdeci">tf::Future< void > run(Taskflow &taskflow)</div><div class="ttdoc">runs a taskflow once</div></div>
<div class="ttc" id="aclasstf_1_1FlowBuilder_html_a4d52a7fe2814b264846a2085e931652c"><div class="ttname"><a href="classtf_1_1FlowBuilder.html#a4d52a7fe2814b264846a2085e931652c">tf::FlowBuilder::emplace</a></div><div class="ttdeci">Task emplace(C &&callable)</div><div class="ttdoc">creates a static task</div><div class="ttdef"><b>Definition</b> flow_builder.hpp:1571</div></div>
<div class="ttc" id="aclasstf_1_1Task_html"><div class="ttname"><a href="classtf_1_1Task.html">tf::Task</a></div><div class="ttdoc">class to create a task handle over a taskflow node</div><div class="ttdef"><b>Definition</b> task.hpp:569</div></div>
<div class="ttc" id="aclasstf_1_1Task_html_a331b1b726555072e7c7d10941257f664"><div class="ttname"><a href="classtf_1_1Task.html#a331b1b726555072e7c7d10941257f664">tf::Task::succeed</a></div><div class="ttdeci">Task & succeed(Ts &&... tasks)</div><div class="ttdoc">adds precedence links from other tasks to this</div><div class="ttdef"><b>Definition</b> task.hpp:1266</div></div>
<div class="ttc" id="aclasstf_1_1Task_html_a8c78c453295a553c1c016e4062da8588"><div class="ttname"><a href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">tf::Task::precede</a></div><div class="ttdeci">Task & precede(Ts &&... tasks)</div><div class="ttdoc">adds precedence links from this to other tasks</div><div class="ttdef"><b>Definition</b> task.hpp:1258</div></div>
<div class="ttc" id="aclasstf_1_1Taskflow_html"><div class="ttname"><a href="classtf_1_1Taskflow.html">tf::Taskflow</a></div><div class="ttdoc">class to create a taskflow object</div><div class="ttdef"><b>Definition</b> taskflow.hpp:64</div></div>
<div class="ttc" id="aclasstf_1_1cudaGraphBase_html_a02a041d5dd9e1e8958eb43e09331051e"><div class="ttname"><a href="classtf_1_1cudaGraphBase.html#a02a041d5dd9e1e8958eb43e09331051e">tf::cudaGraphBase::copy</a></div><div class="ttdeci">cudaTask copy(T *tgt, const T *src, size_t num)</div><div class="ttdoc">creates a memcopy task that copies typed data</div><div class="ttdef"><b>Definition</b> cuda_graph.hpp:1075</div></div>
<div class="ttc" id="aclasstf_1_1cudaGraphBase_html_a1473a15a6023fbc25e1f029f2ff84aec"><div class="ttname"><a href="classtf_1_1cudaGraphBase.html#a1473a15a6023fbc25e1f029f2ff84aec">tf::cudaGraphBase::kernel</a></div><div class="ttdeci">cudaTask kernel(dim3 g, dim3 b, size_t s, F f, ArgsT... args)</div><div class="ttdoc">creates a kernel task</div><div class="ttdef"><b>Definition</b> cuda_graph.hpp:1010</div></div>
<div class="ttc" id="aclasstf_1_1cudaStreamBase_html_a1e5140505629afd4b3422399f8080cb0"><div class="ttname"><a href="classtf_1_1cudaStreamBase.html#a1e5140505629afd4b3422399f8080cb0">tf::cudaStreamBase::synchronize</a></div><div class="ttdeci">cudaStreamBase & synchronize()</div><div class="ttdoc">synchronizes the associated stream</div><div class="ttdef"><b>Definition</b> cuda_stream.hpp:232</div></div>
<div class="ttc" id="aclasstf_1_1cudaStreamBase_html_a7dcdfb79385a57c4c59b7c9f21e8beb9"><div class="ttname"><a href="classtf_1_1cudaStreamBase.html#a7dcdfb79385a57c4c59b7c9f21e8beb9">tf::cudaStreamBase::run</a></div><div class="ttdeci">cudaStreamBase & run(const cudaGraphExecBase< C, D > &exec)</div><div class="ttdoc">runs the given executable CUDA graph</div></div>
<div class="ttc" id="aclasstf_1_1cudaTask_html_a4a9ca1a34bac47e4c9b04eb4fb2f7775"><div class="ttname"><a href="classtf_1_1cudaTask.html#a4a9ca1a34bac47e4c9b04eb4fb2f7775">tf::cudaTask::succeed</a></div><div class="ttdeci">cudaTask & succeed(Ts &&... tasks)</div><div class="ttdoc">adds precedence links from other tasks to this</div><div class="ttdef"><b>Definition</b> cuda_graph.hpp:418</div></div>
<div class="ttc" id="aclasstf_1_1cudaTask_html_abdd68287ec4dff4216af34d1db44d1b4"><div class="ttname"><a href="classtf_1_1cudaTask.html#abdd68287ec4dff4216af34d1db44d1b4">tf::cudaTask::precede</a></div><div class="ttdeci">cudaTask & precede(Ts &&... tasks)</div><div class="ttdoc">adds precedence links from this to other tasks</div><div class="ttdef"><b>Definition</b> cuda_graph.hpp:407</div></div>
<div class="ttc" id="anamespacetf_html_a2be50e6880ead1d49a3fec2fc4bb893e"><div class="ttname"><a href="namespacetf.html#a2be50e6880ead1d49a3fec2fc4bb893e">tf::cudaGraphExec</a></div><div class="ttdeci">cudaGraphExecBase< cudaGraphExecCreator, cudaGraphExecDeleter > cudaGraphExec</div><div class="ttdoc">default smart pointer type to manage a cudaGraphExec_t object with unique ownership</div><div class="ttdef"><b>Definition</b> cudaflow.hpp:23</div></div>
<div class="ttc" id="anamespacetf_html_a713c427e4f9841a90dec67045a3babed"><div class="ttname"><a href="namespacetf.html#a713c427e4f9841a90dec67045a3babed">tf::cudaGraph</a></div><div class="ttdeci">cudaGraphBase< cudaGraphCreator, cudaGraphDeleter > cudaGraph</div><div class="ttdoc">default smart pointer type to manage a cudaGraph_t object with unique ownership</div><div class="ttdef"><b>Definition</b> cudaflow.hpp:18</div></div>
<div class="ttc" id="anamespacetf_html_af19c9b301dc0b0fe2a51a960fa427e83"><div class="ttname"><a href="namespacetf.html#af19c9b301dc0b0fe2a51a960fa427e83">tf::cudaStream</a></div><div class="ttdeci">cudaStreamBase< cudaStreamCreator, cudaStreamDeleter > cudaStream</div><div class="ttdoc">default smart pointer type to manage a cudaStream_t object with unique ownership</div><div class="ttdef"><b>Definition</b> cuda_stream.hpp:340</div></div>
</div><!-- fragment --><p>The outer Taskflow manages CPU-side orchestration: the three allocation tasks run in parallel, then the CUDA graph task runs, and finally GPU memory is freed. Inside the CUDA graph, two H2D copy tasks feed the kernel and the kernel feeds the D2H copy task. The CPU taskflow graph is shown below:</p>
<div class="dotgraph">
<iframe scrolling="no" frameborder="0" src="dot_matrix_multiplication_5.svg" width="447" height="251"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe></div>
<p>After execution, the full task graph including the CUDA sub-graph can be visualised:</p>
<div class="dotgraph">
<iframe scrolling="no" frameborder="0" src="dot_matrix_multiplication_6.svg" width="692" height="498"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe></div>
<h1><a class="anchor" id="MatrixMultiplicationcudaFlowBenchmarking"></a>
Benchmarking</h1>
<p>We compare three versions — sequential CPU, parallel CPU, and one GPU — on a 12-core Intel i7-8700 at 3.20 GHz and a Nvidia RTX 2080:</p>
<div align="center"> <table class="markdownTable">
<tr class="markdownTableHead">
<th class="markdownTableHeadCenter">Matrix size </th><th class="markdownTableHeadCenter">CPU sequential </th><th class="markdownTableHeadCenter">CPU parallel </th><th class="markdownTableHeadCenter">GPU </th></tr>
<tr class="markdownTableRowOdd">
<td class="markdownTableBodyCenter">10×10 </td><td class="markdownTableBodyCenter">0.142 ms </td><td class="markdownTableBodyCenter">0.414 ms </td><td class="markdownTableBodyCenter">82 ms </td></tr>
<tr class="markdownTableRowEven">
<td class="markdownTableBodyCenter">100×100 </td><td class="markdownTableBodyCenter">1.641 ms </td><td class="markdownTableBodyCenter">0.733 ms </td><td class="markdownTableBodyCenter">83 ms </td></tr>
<tr class="markdownTableRowOdd">
<td class="markdownTableBodyCenter">1000×1000 </td><td class="markdownTableBodyCenter">1532 ms </td><td class="markdownTableBodyCenter">504 ms </td><td class="markdownTableBodyCenter">85 ms </td></tr>
<tr class="markdownTableRowEven">
<td class="markdownTableBodyCenter">2000×2000 </td><td class="markdownTableBodyCenter">25688 ms </td><td class="markdownTableBodyCenter">4387 ms </td><td class="markdownTableBodyCenter">133 ms </td></tr>
<tr class="markdownTableRowOdd">
<td class="markdownTableBodyCenter">3000×3000 </td><td class="markdownTableBodyCenter">104838 ms </td><td class="markdownTableBodyCenter">16170 ms </td><td class="markdownTableBodyCenter">214 ms </td></tr>
<tr class="markdownTableRowEven">
<td class="markdownTableBodyCenter">4000×4000 </td><td class="markdownTableBodyCenter">250133 ms </td><td class="markdownTableBodyCenter">39646 ms </td><td class="markdownTableBodyCenter">427 ms </td></tr>
</table>
</div><p>For small matrices the GPU's data transfer overhead dominates and CPU solutions are faster. As problem size grows, the GPU's thread-level parallelism dominates completely. At 4000×4000, the GPU is <b>585×</b> faster than the sequential CPU and <b>92×</b> faster than the parallel CPU solution. </p>
</div></div><!-- contents -->
</div><!-- PageDoc -->
</div><!-- doc-content -->
<!-- HTML footer for doxygen 1.13.1-->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="navelem"><a class="el" href="Examples.html">Learning from Examples</a></li>
<li class="footer">
Maintained by <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>
—
Generated by <a href="https://www.doxygen.org/index.html"><img class="footer" src="doxygen.svg" width="104" height="31" alt="doxygen"/></a> 1.13.1
</li>
</ul>
</div>