RcppParallel/index.html at feature/expanded-site · RcppCore/RcppParallel

515 lines (446 loc) · 23.9 KB
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<title></title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<script src="libs/jqueryui-1.11.4/jquery-ui.min.js"></script>
<link href="libs/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" />
<script src="libs/tocify-1.9.1/jquery.tocify.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/yeti.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
      href="libs/highlight/textmate.css"
      type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
  pre:not([class]) {
    background-color: white;
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
   window.setTimeout(function() {
      hljs.initHighlighting();
<link rel="stylesheet" href="styles.css" type="text/css" />
<style type = "text/css">
.main-container {
  max-width: 940px;
  margin-left: auto;
  margin-right: auto;
  color: inherit;
  background-color: rgba(0, 0, 0, 0.04);
  max-width:100%;
  height: auto;
  font-size: 34px;
  font-size: 38px;
  font-size: 30px;
  font-size: 24px;
  font-size: 18px;
  font-size: 16px;
  font-size: 12px;
<div class="container-fluid main-container">
$(function() {
    // establish options
    var options = {
      selectors: "h1,h2,h3",
      theme: "bootstrap3",
      context: '.toc-content',
      hashGenerator: function (text) {
        return text.replace(/[.\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase();
      ignoreSelector: "h1.title",
      scrollTo: 0
    options.showAndHide = false;
    options.smoothScroll = true;
    // tocify
    var toc = $("#TOC").tocify(options).data("toc-tocify");
<style type="text/css">
  margin: 25px 0px 20px 0px;
@media (max-width: 768px) {
  position: relative;
  width: 100%;
.toc-content {
  padding-left: 30px;
  padding-right: 40px;
div.main-container {
  max-width: 1200px;
div.tocify {
  width: 20%;
  max-width: 260px;
  max-height: 85%;
@media (min-width: 768px) and (max-width: 991px) {
  div.tocify {
    width: 25%;
.tocify ul, .tocify li {
  line-height: 20px;
.tocify-subheader .tocify-item {
  font-size: 0.9em;
  padding-left: 5px;
.tocify .list-group-item {
  border-radius: 0px;
.tocify-subheader {
  display: inline;
.tocify-subheader .tocify-item {
  font-size: 0.95em;
  padding-left: 10px;
<!-- setup 3col/9col grid for toc_float and main content  -->
<div class="row-fluid">
<div class="col-sm-4 col-md-3">
<div id="TOC" class="tocify">
<div class="toc-content col-sm-8 col-md-9">
<div class="navbar navbar-default navbar-inverse navbar-fixed-top" role="navigation">
  <div class="container">
    <div class="navbar-header">
      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
      </button>
      <a class="navbar-brand" href="/">Rcpp Parallel</a>
    <div id="navbar" class="navbar-collapse collapse">
      <ul class="nav navbar-nav">
        <li><a href="/">Home</a></li>
        <li><a href="/tbb.html">Intel TBB</a></li>
        <li><a href="/simd.html">Boost.SIMD</a></li>
      </ul>
      <ul class="nav navbar-nav navbar-right">
        <li><a href="https://github.com/RcppCore/RcppParallel">GitHub</a></li>
      </ul>
    </div><!--/.nav-collapse -->
  </div><!--/.container -->
</div><!--/.navbar -->
// manage active state of menu based on current page
$(document).ready(function () {
  // active menu
  href = window.location.pathname
  href = href.substr(href.lastIndexOf('/'))
  $('a[href="' + href + '"]').parent().addClass('active');
<h1 class="title">
<img id="logo" src="images/RcppParallelLogo.png" width="643" height="90"></img>
<div id="overview" class="section level2">
<h2>Overview</h2>
<p>RcppParallel provides a complete toolkit for creating portable, high-performance parallel algorithms without requiring direct manipulation of operating system threads. RcppParallel includes:</p>
<li><p><a href="https://www.threadingbuildingblocks.org/">Intel TBB</a> (v4.3), a C++ library for task parallelism with a wide variety of parallel algorithms and data structures (Windows, OS X, Linux, and Solaris x86 only).</p></li>
<li><p><a href="http://nt2.metascale.fr/doc/html/boost_simd.html">Boost.SIMD</a>, a C++ template library that provides portable (visa-vi instuction-sets and compilers) access to SIMD extensions.</p></li>
<li><p><a href="http://tinythreadpp.bitsnbites.eu/">TinyThread</a>, a C++ library for portable use of operating system threads.</p></li>
<li><p><code>RVector</code> and <code>RMatrix</code> wrapper classes for safe and convenient access to R data structures in a multi-threaded environment.</p></li>
<li><p>High level parallel functions (<code>parallelFor</code> and <code>parallelReduce</code>) that use Intel TBB as a back-end on systems that support it and TinyThread on other platforms.</p></li>
<div id="examples" class="section level2">
<h2>Examples</h2>
<p>Some simple example uses of RcppParallel along with performance increases achieved over serial code. The benchmarks were executed on a 2.6GHz Haswell MacBook Pro with 4 cores (8 with hyperthreading).</p>
<p><a href="http://gallery.rcpp.org/articles/parallel-matrix-transform/">Parallel Matrix Transform</a> — Demonstrates using <code>parallelFor</code> to transform a matrix (take the square root of each element) in parallel. In this example the parallel version performs about 2.5x faster than the serial version.</p>
<p><a href="http://gallery.rcpp.org/articles/parallel-vector-sum/">Parallel Vector Sum</a> — Demonstrates using <code>parallelReduce</code> to take the sum of a vector in parallel. In this example the parallel version performs 4.5x faster than the serial version.</p>
<p><a href="http://gallery.rcpp.org/articles/parallel-distance-matrix/">Parallel Distance Matrix</a> — Demonstrates using <code>parallelFor</code> to compute pairwise distances for each row in an input data matrix. In this example the parallel version performs 5.5x faster than the serial version.</p>
<p><a href="http://gallery.rcpp.org/articles/parallel-inner-product/">Parallel Inner Product</a> — Demonstrates using <code>parallelReduce</code> to compute the inner product of two vectors in parallel. In this example the parallel version performs 2.5x faster than the serial version.</p>
<p>You may get the hang of using RcppParallel by studying the examples however you should still also review this guide in detail as it includes important documentation on thread safety, tuning, and using Intel TBB directly for more advanced use cases.</p>
<div id="getting-started" class="section level2">
<h2>Getting Started</h2>
<p>You can install the RcppParallel package from CRAN as follows:</p>
<pre class="r"><code>install.packages(&quot;RcppParallel&quot;)</code></pre>
<div id="sourcecpp" class="section level3">
<h3>sourceCpp</h3>
<p>Add the following to a standalone C++ source file to import RcppParallel:</p>
<pre class="cpp"><code>// [[Rcpp::depends(RcppParallel)]]
#include &lt;RcppParallel.h&gt;</code></pre>
<p>When you compile the file using <code>Rcpp::sourceCpp</code> the required compiler and linker settings for RcppParallel will be automatically included in the compilation.</p>
<div id="r-packages" class="section level3">
<h3>R Packages</h3>
<p>If you want to use RcppParallel from within an R package you need to edit several files to create the requisite build and runtime links. The following additions should be made:</p>
<p><strong>DESCRIPTION</strong></p>
<pre class="yaml"><code>Imports: RcppParallel
LinkingTo: RcppParallel
SystemRequirements: GNU make</code></pre>
<p><strong>NAMESPACE</strong></p>
<pre class="r"><code>importFrom(RcppParallel, RcppParallelLibs)</code></pre>
<p><strong>src\Makevars</strong></p>
<pre class="make"><code>PKG_LIBS += $(shell ${R_HOME}/bin/Rscript -e &quot;RcppParallel::RcppParallelLibs()&quot;)</code></pre>
<p><strong>src\Makevars.win</strong></p>
<pre class="make"><code>PKG_CXXFLAGS += -DRCPP_PARALLEL_USE_TBB=1
PKG_LIBS += $(shell &quot;${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe&quot; \
              -e &quot;RcppParallel::RcppParallelLibs()&quot;)</code></pre>
<p>Note that the Windows variation (Makevars.win) requires an extra <code>PKG_CXXFLAGS</code> entry that enables the use of TBB. This is because TBB is not used by default on Windows (for backward compatibility with a previous version of RcppParallel which lacked support for TBB on Windows).</p>
<p>After you’ve added the above to the package you can simply include the main RcppParallel package header in source files that need to use it:</p>
<pre class="cpp"><code>#include &lt;RcppParallel.h&gt;</code></pre>
<div id="thread-safety" class="section level2">
<h2>Thread Safety</h2>
<p>A major goal of RcppParallel is to make it possible to write parallel code without traditional threading and locking primitives (which are notoriously complicated and difficult to get right). This is achieved for the most part by <code>parallelFor</code> and <code>parallelReduce</code> however the fact that the R API itself is single-threaded must also be taken into consideration.</p>
<div id="api-restrictions" class="section level3">
<h3>API Restrictions</h3>
<p>The code that you write within parallel workers should not call the R or Rcpp API in any fashion. This is because R is single-threaded and concurrent interaction with it’s data structures can cause crashes and other undefined behavior. Here is the official guidance from <a href="https://cran.rstudio.com/doc/manuals/r-release/R-exts.html">Writing R Extensions</a>:</p>
<blockquote>
<p>Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. In particular, code which makes use of the stack-checking mechanism must not be called from threaded code.</p>
</blockquote>
<p>Not being able to call the R or Rcpp API creates an obvious challenge: how to read and write to R vectors and matrices. Fortunately, R vectors and matrices are just contiguous arrays of <code>int</code>, <code>double</code>, etc. so can be accessed using traditional array and pointer offsets. The next section describes a safe and high level way to do this.</p>
<div id="safe-accessors" class="section level3">
<h3>Safe Accessors</h3>
<p>To provide safe and convenient access to the arrays underlying R vectors and matrices RcppParallel introduces several accessor classes:</p>
<li><p><code>RVector&lt;T&gt;</code> — Wrap R vectors of various types</p></li>
<li><p><code>RMatrix&lt;T&gt;</code> — Wrap R matrices of various types (also includes <code>Row</code> and <code>Column</code> classes)</p></li>
<p>To create a thread safe accessor for an Rcpp vector or matrix just construct an instance of <code>RVector</code> or <code>RMatrix</code> with it. For example:</p>
<pre class="cpp"><code>// [[Rcpp::export]]
IntegerVector transformVector(IntegerVector x) {
  RVector&lt;int&gt; input(x);
  // etc...
}</code></pre>
<p>Similarly, if you need to return a vector as a result of a parallel transformation you should first create it using Rcpp then construct a wrapper for writing from multiple threads. For example:</p>
<pre class="cpp"><code>// [[Rcpp::export]]
IntegerVector transformVector(IntegerVector x) {
  RVector&lt;int&gt; input(x);        // create threadsafe wrapper to input
  IntegerVector y(x.size());    // allocate output vector
  RVector&lt;int&gt; output(y);       // create threadsafe wrapper to output
  // ...transform vector in parallel ...
  return y;
}</code></pre>
<div id="locking" class="section level3">
<h3>Locking</h3>
<p>When using RcppParallel you typically do not need to worry about explicit locking, as the mechanics of <code>parallelFor</code> and <code>parallelReduce</code> (explained below) take care of providing safe windows into input and output data that have no possibility of contention. Nevertheless, if for some reason you do need to synchronize access to shared data, you can use the TinyThread locking classes (automatically available via <code>RcppParallel.h</code>):</p>
<table style="width:114%;">
<col width="29%" />
<col width="84%" />
</colgroup>
<tr class="header">
<th align="left">Function</th>
<th align="left">Description</th>
<tr class="odd">
<td align="left"><a href="http://tinythreadpp.bitsnbites.eu/doc/classtthread_1_1lock__guard.html"><code>lock_guard</code></a></td>
<td align="left">Lock guard class. The constructor locks the mutex, and the destructor unlocks the mutex, so the mutex will automatically be unlocked when the lock guard goes out of scope.</td>
<tr class="even">
<td align="left"><a href="http://tinythreadpp.bitsnbites.eu/doc/classtthread_1_1mutex.html"><code>mutex</code></a></td>
<td align="left">Mutual exclusion object for synchronizing access to shared memory areas for several threads. The mutex is non-recursive (i.e. a program may deadlock if the thread that owns a mutex object calls lock() on that object).</td>
<tr class="odd">
<td align="left"><a href="http://tinythreadpp.bitsnbites.eu/doc/classtthread_1_1recursive__mutex.html"><code>recursive_mutex</code></a></td>
<td align="left">Mutual exclusion object for synchronizing access to shared memory areas for several threads. The mutex is recursive (i.e. a thread may lock the mutex several times, as long as it unlocks the mutex the same number of times).</td>
<tr class="even">
<td align="left"><a href="http://tinythreadpp.bitsnbites.eu/doc/classtthread_1_1fast__mutex.htmll"><code>fast_mutex</code></a></td>
<td align="left">Mutual exclusion object for synchronizing access to shared memory areas for several threads. It is similar to the tthread::mutex class, but instead of using system level functions, it is implemented as an atomic spin lock with very low CPU overhead.</td>
<p>See the complete <a href="http://tinythreadpp.bitsnbites.eu/doc/">TinyThread documentation</a> for additional details.</p>
<p>The TinyThread locking primitives will work on all platforms. If you are using TBB directly you can alternatively use the synchronization classes provided by TBB. See the section on <a href="tbb.html#synchronization">TBB Synchronization</a> for additional details.</p>
<div id="algorithms" class="section level2">
<h2>Algorithms</h2>
<p>RcppParallel provides two high level parallel algorithms: <code>parallelFor</code> can be used to convert the work of a standard serial “for” loop into a parallel one and <code>parallelReduce</code> can be used for accumulating aggregate or other values.</p>
<div id="parallelfor" class="section level3">
<h3>parallelFor</h3>
<p>To use <code>parallelFor</code>, you create a <code>Worker</code> object that defines an <code>operator()</code> which is called by the parallel scheduler. This function is passed a <code>[begin,end)</code> exclusive range which is a safe window (i.e. not in use by other threads) into the input or output data. Note that the <code>end</code> element is not included in the range (just like an STL <code>end</code> iterator).</p>
<p>For example, here’s a <code>Worker</code> object that takes the square root of it’s input and writes it into it’s output:</p>
<pre class="cpp"><code>// [[Rcpp::depends(RcppParallel)]]
#include &lt;RcppParallel.h&gt;
using namespace RcppParallel;
struct SquareRoot : public Worker
   // source matrix
   const RMatrix&lt;double&gt; input;
   // destination matrix
   RMatrix&lt;double&gt; output;
   // initialize with source and destination
   SquareRoot(const NumericMatrix input, NumericMatrix output) 
      : input(input), output(output) {}
   // take the square root of the range of elements requested
   void operator()(std::size_t begin, std::size_t end) {
      std::transform(input.begin() + begin, 
                     input.begin() + end, 
                     output.begin() + begin, 
                     ::sqrt);
</code></pre>
<p>Note that <code>SquareRoot</code> derives from <code>RcppParallel::Worker</code>. This is required for function objects passed to <code>parallelFor</code>.</p>
<p>Here’s a function that calls the <code>SquareRoot</code> worker we defined:</p>
<pre class="cpp"><code>// [[Rcpp::export]]
NumericMatrix parallelMatrixSqrt(NumericMatrix x) {
  // allocate the output matrix
  NumericMatrix output(x.nrow(), x.ncol());
  // SquareRoot functor (pass input and output matrixes)
  SquareRoot squareRoot(x, output);
  // call parallelFor to do the work
  parallelFor(0, x.length(), squareRoot);
  // return the output matrix
  return output;
}</code></pre>
<div id="parallelreduce" class="section level3">
<h3>parallelReduce</h3>
<p>To use <code>parallelReduce</code> you create a <code>Worker</code> object as well, this object should include:</p>
<ol style="list-style-type: decimal">
<li><p>A standard and “splitting” constructor. The standard constructor takes the input data and initializes whatever value is being accumulated (e.g. initialize a sum to zero). The splitting constructor is called when work needs to be split onto other threads—it takes a reference to the instance it is being split from and simply copies the pointer to the input data and initializes it’s “accumulated” value to zero.</p></li>
<li><p>An operator() which performs the work. This works just like the operator() in <code>parallelFor</code>, but instead of writing to another vector or matrix it typically will accumulate a value.</p></li>
<li><p>A join method which composes the operations of two worker instances that were previously split. Here we simply combine the accumulated value of the instance we are being joined with to our own.</p></li>
<p>For example, here’s a <code>Worker</code> object that is used to sum a vector:</p>
<pre class="cpp"><code>// [[Rcpp::depends(RcppParallel)]]
#include &lt;RcppParallel.h&gt;
using namespace RcppParallel;
struct Sum : public Worker
   // source vector
   const RVector&lt;double&gt; input;
   // accumulated value
   double value;
   // constructors
   Sum(const NumericVector input) : input(input), value(0) {}
   Sum(const Sum&amp; sum, Split) : input(sum.input), value(0) {}
   // accumulate just the element of the range I&#39;ve been asked to
   void operator()(std::size_t begin, std::size_t end) {
      value += std::accumulate(input.begin() + begin, input.begin() + end, 0.0);
   // join my value with that of another Sum
   void join(const Sum&amp; rhs) { 
      value += rhs.value; 
};</code></pre>
<p>Now that we’ve defined the Worker, implementing the parallel sum function is straightforward. Just initialize an instance of <code>Sum</code> with an input vector and call <code>parallelReduce</code>:</p>
<pre class="cpp"><code>// [[Rcpp::export]]
double parallelVectorSum(NumericVector x) {
   // declare the Sum instance 
   Sum sum(x);
   // call parallel_reduce to start the work
   parallelReduce(0, x.length(), sum);
   // return the computed sum
   return sum.value;
}</code></pre>
<div id="tbb-algorithms" class="section level3">
<h3>TBB Algorithms</h3>
<p>RcppParallel provides the <code>parallelFor</code> and <code>parallelReduce</code> algorithms however the TBB library includes a wealth of more advanced algorithms and other tools for parallelization. See the <a href="tbb.html">Intel TBB</a> article for additional details.</p>
<div id="tuning" class="section level2">
<h2>Tuning</h2>
<p>There are several settings available for tuning the behavior of parallel algorithms. These settings as well as benchmarking techniques are covered below.</p>
<div id="grain-size" class="section level3">
<h3>Grain Size</h3>
<p>The grain size of a parallel algorithm sets a minimum chunk size for parallelization. In other words, at what point to stop processing input on separate threads (as sometimes creating more threads can degrade the performance of an algorithm by introducing excessive synchronization overhead).</p>
<p>By default the grain size for TBB (and thus for <code>parallelFor</code> and <code>parallelReduce</code>) is 1. You can change the grain size by passing an additional parameter to these functions. For example:</p>
<pre class="cpp"><code>parallelReduce(0, x.length(), sum, 100);</code></pre>
<p>This will prevent the creation of threads that process less than 100 items. You should experiment with various chunk sizes and use the benchmarking tools described below to measure their effectiveness. The Intel TBB website includes a detailed <a href="https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/Controlling_Chunking.htm">discussion of grain sizes and partitioning</a> which has some useful guidelines for tweaking grain sizes.</p>
<div id="threads-used" class="section level3">
<h3>Threads Used</h3>
<p>By default all of the available cores on a machine are used for parallel algorithms. You may instead want to use a fixed number of threads or a fixed proportion of cores available on the machine.</p>
<p>R rather than C++ functions are provided to control these settings so that users of your algorithm can control the use of resources on their system. You can call the <code>setThreadOptions</code> function to allocate threads. For example, the following sets a maximum of 4 threads:</p>
<pre class="r"><code>RcppParallel::setThreadOptions(numThreads = 4)</code></pre>
<p>To use a proportion of available cores you can use the <code>defaultNumThreads</code> function. For example, the following says to use half of the available cores on a system:</p>
<pre class="r"><code>library(RcppParallel)
setThreadOptions(numThreads = defaultNumThreads() / 2)</code></pre>
<div id="benchmarking" class="section level3">
<h3>Benchmarking</h3>
<p>As you experiment with various settings to tune your parallel algorithms you should always measure the results. The <strong>rbenchmark</strong> package has some useful tools for doing this. For example, here’s a benchmark of the parallel matrix square root example from above (in this case it’s a comparison against the serial version):</p>
<pre class="r"><code># allocate a matrix
m &lt;- matrix(as.numeric(c(1:1000000)), nrow = 1000, ncol = 1000)
# ensure that serial and parallel versions give the same result
stopifnot(identical(matrixSqrt(m), parallelMatrixSqrt(m)))
# compare performance of serial and parallel
library(rbenchmark)
res &lt;- benchmark(matrixSqrt(m),
                 parallelMatrixSqrt(m),
                 order=&quot;relative&quot;)
res[,1:4]</code></pre>
<pre><code>                   test replications elapsed relative
2 parallelMatrixSqrt(m)          100   0.294    1.000
1         matrixSqrt(m)          100   0.755    2.568</code></pre>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');
<!-- dynamically load mathjax for compatibility with self-contained -->
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

index.html

Latest commit

History

index.html

File metadata and controls