The pSTL_offload sample demonstrates the offloading of C++ standard parallel algorithms to a SYCL device.
| Area | Description |
|---|---|
| What you will learn | Offloading of C++ standard algorithms to GPU devices. |
| Time to complete | 15 minutes |
| Category | Concepts and Functionality |
Note: This sample is based on the cppParallelSTL GitHub repository.
Offloading the C++ standard parallel STL code (par-unseq policy) to GPU and CPU without any code changes when using the -fsycl-pstl-offload compiler option with Intel® DPC+/C+ compiler. It is an experimental feature of oneDPL.
This folder contains three sample examples in the following folders:
| Folder Name | Description |
|---|---|
FileWordCount |
Counting Words in Files Example |
WordCount |
Counting Words generated Example |
| 'ParSTLTests' | Examples of Various STL Algorithms with Execution Policies |
Note: For more information refer to Get Started with Parallel STL.
| Optimized for | Description |
|---|---|
| OS | Ubuntu* 22.04 |
| Hardware | Intel® Data Center GPU Max Intel® Xeon CPU |
| Software | Intel oneAPI Base Toolkit version 2024.2 Intel® Threading Building Blocks (Intel® TBB) |
The example includes three samples FileWordCount , WordCount and and ParSTLTests. FileWordCount and WordCount counts the number of words which count the number of words in files and the number of words generated respectively using the standard C++17 Parallel Algorithm transfor_reduce. ParSTLTests demonstrates the use of various STL algorithms with different execution policies (seq, par, par_unseq). It applies these algorithms to large datasets and prints the results for each execution. This computation can be offloaded to the GPU device with the help of -fsycl-pstl-offload compiler option and standard header inclusion is explicitly required for PSTL Offload to work.
FileWordCount sample also demonstrates the use of transform, copy, copy_if, and for_each standard C++17 Parallel Algorithms. . The ParSTLTests uses STL algorithms such as reduce, accumulate, find, copy_if, inclusive_scan, min_element, max_element, minmax_element, is_partitioned, lexicographical_compare, binary_search, lower_bound, and upper_bound. These algorithms perform tasks like summing elements, finding values, copying based on conditions, scanning, and searching within large datasets.
The -fsycl-pstl-offload option enables the offloading of C++ standard parallel algorithms that were only called with std::execution::par_unseq policy to a SYCL device. The offloaded algorithms are implemented via the oneAPI Data Parallel C++ Library (oneDPL). This option is an experimental feature. If the argument is not specified, the compiler offloads to the default SYCL device.
The performance of memory allocations may be improved by using the SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR environment variable.
When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the setvars script every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.
Note: If you have not already done so, set up your CLI environment by sourcing the
setvarsscript at the root of your oneAPI installation.Linux*:
- For system wide installations:
. /opt/intel/oneapi/setvars.sh- For private installations:
. ~/intel/oneapi/setvars.sh- For non-POSIX shells, like csh, use the following command:
bash -c 'source <install-dir>/setvars.sh ; exec csh'Windows*:
C:\Program Files (x86)\Intel\oneAPI\setvars.bat- Windows PowerShell*, use the following command:
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS*
-
Change to the sample directory.
-
Build the program.
$ mkdir build $ cd build $ ( cmake -D GPU=1 .. ) or ( cmake -D CPU=1 .. ) $ makeNote: Enable GPU flag during the build which supports Intel® Data Center GPU Max 1550 or 1100 to execution on GPUs.
Enable CPU flag during the build to execution on GPU.This command sequence will build the
WordCountandFileWordCountsamples. -
Run the program.
Run
pSTL_offload-WordCounton GPU.$ export ONEAPI_DEVICE_SELECTOR=level_zero:gpu $ make run_wc $ unset ONEAPI_DEVICE_SELECTORRun
pSTL_offload-WordCounton CPU.$ export ONEAPI_DEVICE_SELECTOR=*:cpu $ make run_wc $ unset ONEAPI_DEVICE_SELECTORRun
pSTL_offload-FileWordCounton GPU.$ export ONEAPI_DEVICE_SELECTOR=level_zero:gpu $ make run_fwc0 //for SEQ Policy $ make run_fwc1 //for PAR Policy $ unset ONEAPI_DEVICE_SELECTORRun
pSTL_offload-FileWordCounton CPU.$ export ONEAPI_DEVICE_SELECTOR=*:cpu $ make run_fwc0 //for SEQ Policy $ make run_fwc1 //for PAR Policy $ unset ONEAPI_DEVICE_SELECTORRun
pSTL_offload-ParSTLTeston GPU.$ export ONEAPI_DEVICE_SELECTOR=level_zero:gpu $ ./ParSTLTest $ unset ONEAPI_DEVICE_SELECTORRun
pSTL_offload-ParSTLTeston CPU.$ export ONEAPI_DEVICE_SELECTOR=*:cpu $ ./ParSTLTest $ unset ONEAPI_DEVICE_SELECTOR
If an error occurs, you can get more details by running make with the VERBOSE=1 argument:
$ make VERBOSE=1
If you receive an error message, troubleshoot the problem using the Diagnostics Utility for Intel® oneAPI Toolkits. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the Diagnostics Utility for Intel® oneAPI Toolkits User Guide for more information on using the utility.
Code samples are licensed under the MIT license. See License.txt for details.
Third party program licenses are at third-party-programs.txt.